I am writing a python code using mpi4py from which I import MPI. Then, I set up the global communicator MPI.COMM_WORLD and store in the variable comm.
I am running this code with n > 1 threads and at some point they all enter a for loop (all cores have the same number of iterations to go through).
Inside the for loop I have a "comm.reduce(...)" call.
This seems to work for a small number of cores but as the problem size increases (with 64 cores, say) I experience that my program "hangs".
So I am wondering if this has to do with the reduce(...) call. I know that this call needs all threads (that is, say we run 2 threads in total. If one thread enters the loop but the other doesn't for whatever reason, the program will hang because the reduce(...) call waits for both threads).
My question is:
Is the reduce call a "synchronization" task, i.e., does it work like a "comm.Barrier()" call?
And, if possible, in more general, what are the synchronization tasks (if any besides Barrier)?
Yes, the standard MPI reduce call is blocking (all threads must communicate to root before any thread can proceed). Other blocking calls are Allgather, Allreduce, AlltoAll, Barrier, Bsend, Gather, Recv, Reduce, Scatter, etc.
Many of these have non-blocking equivalents, which you'll find preceded by an I (Isend e.g.) but these aren't implemented across the board in mpi4py.
See mpi: blocking vs non-blocking for more info on that.
Not sure about your hangup. May be an issue of processor crowding--running a 64 thread job on a 4 core desktop might get loud.
Related
I was wondering how the threads are executed on hardware level, like a process would run on a single processing core and make a context switch on the processor and the MMU in order to switch between processes. How do threads switch? Secondly when we create/spawn a new thread will it be seen as a new process would for the processor and be scheduled as a process would?
Also when should one use threads and when a new process?
I know I probably am sounding dumb right now, that's because I have massive gaps in my knowledge that I would like fill. Thanks in advance for taking the time and explaining things to me. :)
There are a few different methods for concurrency. The threading module creates threads within the same Python process and switches between them, this means they're not really running at the same time. The same happens with the Asyncio module, however this has the additional feature of setting when a thread can be switched.
Then there is the multiprocessing module which creates a separate Python process per thread. This means that the threads will not have access to shared memory but can mean that the processes run on different CPU cores and therefore can provide a performance improvement for CPU bound tasks.
Regarding when to use new threads a good rule of thumb would be:
For I/O bound problems, use threading or async I/O. This is because you're waiting on responses from something external, like a database or browser, and this waiting time can instead be filled by another thread running it's task.
For CPU bound problems use multiprocessing. This can run multiple Python processes on separate cores at the same time.
Disclaimer: Threading is not always a solution and you should first determine whether it is necessary and then look to implement the solution.
Think of it this way: "a thread is part of a process."
A "process" owns resources such as memory, open file-handles and network ports, and so on. All of these resources are then available to every "thread" which the process owns. (By definition, every "process" always contains at least one ("main") "thread.")
CPUs and cores, then, execute these "threads," in the context of the "process" which they belong to.
On a multi-CPU/multi-core system, it is therefore possible that more than one thread belonging to a particular process really is executing in parallel. Although you can never be sure.
Also: in the context of an interpreter-based programming language system like Python, the actual situation is a little bit more complicated "behind the scenes," because the Python interpreter context does exist and will be seen by all of the Python threads. This does add a slight amount of additional overhead so that it all "just works."
On the OS level, threads are units of execution that share the same resources (memory, file descriptors, etc). Groups of threads that belong to different processes are isolated from each other, can't access resources across the process boundary. You can think of a "just process" as a single thread, not unlike any other thread.
OS threads are scheduled like you would expect: if there are several cores, they can run in parallel; if there are more threads / processes ready to run than there are cores, some threads get preempted after some time, paused, and another thread has a chance to run on that core.
In Python, though, the difference between threads (threading module) and processes (multiproceessing module) is drastic.
Python runs in a VM. Threads run within that VM. Objects within the VM are reference-counted, and also are unsafe to concurrently modify. So OS thread scheduling which can preempt one thread in the middle of a VM instruction modifying an object, and give control to another object that accesses the same object, will result in corruption.
This is why the global interpreter lock aka GIL exists. It basically prevents any computational parallelism between Python "threads": only one thread can proceed at a time, no matter how many CPU cores you have. Python threads are only good for waiting for I/O.
Unlike that, multiprocessing runs a parallel VM (Python interpreter) and shares select pieces of data with it in a safe way (by copying, or using shared memory). Such parallel processes can run in parallel and utilize multiple CPU cores.
In short: Python threads ≠ OS threads.
I am using the threading module to run a function in the background while the rest of my script executes. The threaded function contains a for loop which waits for external 5 volt triggers, occuring every 15 ms, before continuing to the next loop iteration.
When this code is the only thing running on the PC, everything works as expected. However when I run other necessary applications, putting strain on the cpu, the For loop in the threaded function only executes and continues to the next iteration within the 15 ms time window about 90% of the time.
The input to the threaded function is a list of ctypes pointers.
I am running the threaded function from within a class, so using multiprocessing is tricky (if that would help at all i'm not sure).
I've tried to illustrate the problem below with a skeleton of the two classes
import ctypes
import Write_transient_frames_func
import SendScriptCommands
from threading import Thread
class SlmInterface():
def __init__(self,sdk):
self.sdk = sdk
def precalculate_masks(self, mask_list):
'''takes input mask_list, a list of numpy arrays containing phase masks
outputs pointers to memory location of masks
'''
#list of pointers to locations of phase mask arrays in memory
mask_pointers = [mask.ctypes.data_as(POINTER(c_ubyte)) for mask in mask_list]
return mask_pointers
def load_precalculated_triggered(self, mask_pointers):
okay = True
print('Ready to trigger')
for arr in mask_pointers:
okay = self.Write_transient_frames_func(self.sdk, c_int(1), arr, c_bool(1), c_bool(1), c_uint(0))
assert okay, 'Failed to write frames to board'
print('completed trigger sequence')
class Experiment():
def run_experiment(self, sdk, mask_list):
slm = SlmInterface(sdk)
#list of ctypes pointers
mask_pointers = slm.precalculate_masks(mask_list)
##the threaded function
slm_thread = Thread(target=slm.load_precalculated_triggered, args = [mask_pointers])
slm_thread.start()
time.sleep(0.1)
# this function loads the 15ms trigger sequences to the hardware and begins the sequence
self.mp_output = SendScriptCommands()
Is it possible to speed up execution of the threaded function? Would parallel processing help? Or am i fundamentally limited by my cpu?
Unfortunately, Python will likely not be able to do much better. Python has a global interpreter lock, which means that multithreading doesn't work the way it does in other languages.
You should be aware of the fact that multithreading in python makes the application run slower. The good alternative is using asyncio, because it allows cooperative multitasking of several tasks within one thread (-> the os doesn't need to actually switch a thread -> less overhead -> faster execution). If you havn't used that before it's kind of weired to use at first but it's actually really nice.
However, your task really seems to be cpu bound. So maybe the only option is multiprocessing in python.
Probably Python isn't really the culprit here. The point is, with general purpose, preemptive, multiuser operating systems you are not going to get the guarantee of running continuatively enough to catch triggers any 15 ms. CPU is allocated in quanta of generally some tens of ms, and the OS can - and will - let your thread run more or less frequently depending on the CPU load, in an effort to give to each process its fair share of available CPU time.
You may increase the priority of your thread to ask for it to have the precedence over the others, or, in the extreme case, change it to real-time priority to let it hog the CPU indefinitely (and potentially hang the system if stuff goes awry).
But really, the actually solution is to handle this at lower level, either in kernel mode or in hardware. Polling at those rates from user mode is unadvisable if you cannot miss a signal, so you should probably investigate if your hardware/driver provides some higher level interface - for example, an interrupt (translated e.g. to unlocking some blocking call, or producing a signal or something) on trigger.
Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.
In an asynchronous program (e.g., asyncio, twisted etc.), all system calls must be non-blocking. That means a non-blocking select (or something equivalent) needs be executed in every iteration of the main loop. That seems more wasteful than the multi-threaded approach where each thread can use a blocking call and sleep (without wasting CPU resource) until the socket is ready.
Does this sometimes cause asynchronous programs to be slower than their multi-threaded alternatives (despite thread switching costs), or is there some mechanism that makes this not a valid concern?
When working with select in a single thread program, you do not have to continuously check the results. The right way to work with it is to let it block until the relevant I/O has arrived, just like in the case of multi threads.
However, instead of waiting for a single socket (or other I/O), the select call gets a list of relevant sockets, and blocks until any of them is interrupted.
Once that happens, select wakes-up and returns a list of the sockets (or I/Os) that are ready. It is up to the coder to handle those ready sockets in the required way, and then, if the code has nothing else to do, it might start another iteration of the select.
As you can see, no polling loop is required; the program does not require CPU resources until one or more of the required sockets are ready. Moreover, if a few sockets were ready almost together, then the code wakes-up once, handle all of them, and only then start select again. Add to that the fact that the program does not requires the resources overhead of a few threads, and you can see why this is more effective in terms of OS resources.
In my question I separated the I/O handling into two categories: polling represented by non-blocking select, and "callback" represented by the blocking select. (The blocking select sleeps the thread, so it's not strictly speaking a callback; but conceptually it is similar to a callback, since it doesn't use CPU cycles until the I/O is ready. Since I don't know the precise term, I'll just use "callback").
I assumed that asynchronous model cannot use "callback" I/O. It now seems to me that this assumption was incorrect. While an asynchronous program should not be using non-blocking select, and it cannot strictly request a traditional callback from the OS either, it can certainly provide OS with its main event loop and say a coroutine, and ask the OS to create a task in that event loop using that coroutine when an I/O socket is ready. This would not use any of the program's CPU cycles until the I/O is ready. (It might use OS kernel's CPU cycles if it uses polling rather than interrupts for I/O, but that would be the case even with a multi-threaded program.)
Of course, this requires that the OS supports the asynchronous framework used by the program. It probably doesn't. But even then, it seems quite straightforward to add an middle layer that uses a single separate thread and blocking select to talk to the OS, and whenever I/O is ready, creates a task to the program's main event loop. If this layer is included in the interpreter, the program would look perfectly asynchronous. If this layer is added as a library, the program would be largely asynchronous, apart from a simple additional thread that converts synchronous I/O to asynchronous I/O.
I have no idea whether any of this is done in python, but it seems plausible conceptually.
I have a script which runs quite a lot of concurrent threads (at least 200). Every thread does some quite complex evaluations, which can take unpredictably lot of time. The evaluation method is implemented in C and I can't change it. I want to limit the method execution time for every thread. Please advise.
From what I understand of your problem, it might be a good case for using multiprocessing instead of multithreading. Multiprocessing will allow you to make use of all the available resources on the system - and then some, if you're not careful.
Threads don't actually run in parallel, so unless you're doing a lot of waiting for I/O or something like that, it would make more sense to call it from a separate process. You could use the Python multiprocessing library to call it from a Python script, or you could use a wrapper written in C and use some form of interprocess communication. The second option will avoid the overhead of launching another Python instance just to run some C code.
You could call time.sleep (or perform other tasks and check the system clock for elapsed time), and then check for results after the desired interval, permitting any processes that haven't finished to continue running while you make use of the results. Or, if you don't care at that point, you can send a signal to kill the process.