I read something about hyperthreading and found out how beneficial it is for performace. I know the idea is something like when 2 disjoint processes on core are going to execute, in hyperthreading it can be executed in the same time. I also know, that hyperthreading is using some magic with spliting core to 2 virtual ones.
But imagine this situation. I have some basic python script doing some maths. Its 1 threaded process and can push core to 100% for a long time. Its always doing the same, so it cannot use hyperthreading best feature to execute in same time.
What will be the difference in performance in HT on and off?
In ht on, will i have only 50-60% of the proccess performance or close to 100%?
My example: I have 2 cores(4 threads), and im running 1 program on 100%, windows manager is showing around 32-33% cpu used by that process. I also try it with Process Explorer, there i see its capped with 25% on that process.So, if i dont have HT on, it will be 50%. But im now not sure, if im using only half of the core power, or its showing wrong numbers or whats over. I need some explanation.
Python does not allow taking advantage of processor parallelism except for some special cases (like numpy, gzip and other C extensions). For more information, read about the GIL.
HT will not change anything unless your system has other things running as well.
PS: You can test and measure this!
Related
I’m using a project where each time you double the number of threads, you add between 40% to 60% overhead. As hyperthreading increases performance to a maximum of 30% this means, the program runs slower than in single thread mode on hyperthreaded systems.
The first steps seem to be simple.
Get the number of threads on the system through len(os.sched_getaffinity(0))
Restrict the number of threads through z3 parameters.
Bind the threads to physical cores using os.sched_setaffinity(0,mask).
Leave smt solutions enabled for systems not containing Intel or amd inside platform.machine().
However several problems arise for doing this.
How to know if the system has hyperthreading enabled?
Before using os.sched_setaffinity(0,mask), how to know which cpu core numbers are physical or logical?
The problem is the program currently supports a wide number of platforms through python3: all Unixes, as well as Windows and Osx and Openvms while not forgetting PyPy.
Any patch to fix the problem shouldn’t spawn a new process nor add a non-included dependency nor drop support for some of the platforms above.
What can be a clean way to fix this?
The loky library contains a fairly portable solution to this. It does spawn a process, and then caches the result -- so it's not like you're spawning a process more than once. Given this is the solution which backs popular libraries like sklearn, I would guess that it's almost as good as it gets.
I have a python script, that calls HandBrakeCli as a subprocess on Linux. During normal operations, HandBrakeCli takes almost 100% of available CPU resources. But sometimes it seems to starve or whatever - it seems not to be active, but does not return for a long time. This seems to be due to a broken DVD (Handbrake is a dvd ripping tool) - it seems that Handbrake is not handling this so well. It will finish after a long while (hours), but bevor it does nothing. Whereas in normal operation it will use lots of cpu cycles.
So my idea to solve this was to have a watch dog that checks if there is a HandBrakeCli process, and if so monitors if it does use a good part of the cpu. If that is not the case for maybe 5 minutes, it will kill that process, so that the parent script can continue its operations.
It does not seem too hard to program this, but I have a feeling it could involve some tedium. Also it seems not unlikely that this problem has been solved before. Is there a solution for this, possibly in python? If there is a standalone program or daemone that does the job on Linux I would not mind if its not python, as long as it is open source.
Well perhaps the most obvious answer for something to monitor and control child process in python would have to be supervisord. It's very mature, and has a lot of features and documentation.
It is written in Python, and has a plugin API, as well as a service API and a web dashboard.
Your problem description is a little bit vague though, re 'starvation'. It does sound as if your job is blocking on some resource, but you might want to diagnose this a little more precisely before reaching for a complicated software scaffolding to prop it up.
i.e. perhaps it's blocked on I/O (network or disk bandwidth), or it's stalled reading from a pipeline that's failed. You might want to look into process control (e.g. the renice command ) to manage it's CPU allocation a little better etc.
I'm playing around with some measurement instruments through PyVisa (Python wrapper for VISA). Specifically I need to read the measurement value from four instruments, similar to this:
current1 = instrument1.ask("READ?")
current2 = instrument2.ask("READ?")
current3 = instrument3.ask("READ?")
current4 = instrument4.ask("READ?")
For my application, speed is a must. Individually, I can get between 50 and 200 measurements per second from the four instruments, but unfortunately my current code evaluates the four instruments serially.
From what I've read, there's some options with Threading and Multiprocessing in Python, but it's not obvious what the best, and fastest, option for me.
Best case scenario I will be spawning ~4x50 threads per second, so overhead is a bit of a concern.
The task is in no way CPU intensive, it is simply a matter of waiting for the readout from the instrument.
Any advice on what the proper course of action might be?
Many thanks in advance!
Try using the multiprocessing module in python. Because of the way the python interpreter is written, the threading module for python actually slows down your application when using multiple threads on a computer with multiple processors. This is due to something called the Global Interpreter Lock (GIL) which only allows one python thread per process to be run at a time. This makes things like memory management and concurrency easier for the interpreter.
I would suggest trying to create a process for each instrument (producer process) and send the results to a shared queue that gets processed by a single process (consumer process).
I would also limit the number of processes that you create to the number of processors on your computer. Creating too many processes introduces a lot of overhead.
If you really feel like you need the 200 threads you suggested in your post, I would program this in Java or C++.
I'm not sure if this fits your requirements I'm not too familiar with pyVisa, however have you looked into pypy's stackless examples the speed increase is pretty impressive, and you're able to spawn almost infinite amount of 'threadlets' without much hindrance to your system / program.
pypy is kinda limited in terms of supporting various packages but looks like pyVisa isn't one of them.
https://pypi.python.org/pypi/PyVISA
I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.
For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.
To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.
Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/
Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!
I have just written a script that is intended to be run 24/7 to update some files. However, if it takes 3 minutes to update one file, then it would take 300 minutes to update 100 files.
Is it possible to run n instances of the script to manage n separate files to speed up the turnaround time?
Yes it is possible. Use the multiprocessing module to start several concurrent processes. This has the advantage that you do not run into problems because of the Global Interpreter Lock and threads as is explained in the manual page. The manual page includes all the examples you will need to make your script execute in parallel. Of course this works best if the processes do not have to interact, which your example suggests.
I suggest you first find out if there is any way to reduce the 3 minutes in a single thread.
The method I use to discover speedup opportunities is demonstrated here.
That will also tell you if you are purely I/O bound.
If you are completely I/O bound, and all files are on a single disk, parallelism won't help.
In that case, possibly storing the files on a solid-state drive would help.
On the other hand, if you are CPU bound, parallelism will help, as #hochl said.
Regardless, find the speedup opportunities and fix them.
I've never seen any good-size program that didn't have one or several of them.
That will give you one speedup factor, and parallelism will give you another, and the total speedup will be the product of those two factors.