multiprocessing.pool.Pool/ThreadPool can create workers lazily?

multiprocessing.pool.Pool/ThreadPool can create workers lazily? - python

I'd like declare my ThreadPool globally in my program.
However, multiprocessing.pool.ThreadPool creates worker threads greadily, creating them even if no jobs are submitted to the pool. This causes various sorts of problems even when I don't end up allocating threads (e.g. PyCharm is utterly unusable for me when I set breakpoints in a multi-threaded program, and also dingleberry zombie processes are left if the process is not shutdown cleanly).
Is there a way to ask multiprocessing.pool.ThreadPool to create the threads lazily?

Related

Multiprocessing in python vs number of cores

If a run a python script where i declare 6 processes using multiprocessing, but i only have 4 CPU cores, what happens to the additional 2 processes which can find a dedicated CPU core.
How are they executed?
If the two additional processes run as separate threads on the existing Cores, will GIL not stop their execution?
#Edit 1 - 21st Jan 2021
I have mixed up threads and processes in the question I asked. Since I have better clarity on the concept, I would rephrase question 2 as follows(for any future reference):
If the two additional processes run in parallel with two other processes in existing Cores, will GIL not stop their execution?
Ans: GIL does NOT affect the processes, GIL allows only one thread to run at a time, there is no restriction on processes, however. The system scheduler manages how the additional two processes will run on the existing cores.

First you are mixing up threads and processes: in Python only threads not processes have to share a lock on their interpreter.
If your are using the multiprocessing library then, your are using Python processes which have their own interpreter.
When you are using Python processes, their execution is managed by your operating system scheduler, in the same manner as every other processes in your computer.
If you have more processes than CPU cores then the extra processes are waiting in background to be scheduled.
This usually happen when an other process terminates, wait on an IO, or periodically with clock interrupts.

It is always best practice to make sure that you use
pool = multiprocessing.Semaphore(multiprocessing.cpu_count() - 1)
#this will detect the number of cores in your system and creates a semaphore with that value.
When you create a process it takes overhead to manage it, its memory space, and its shared memory. Also, the operating system has to run, so leaving a core free is always polite and speeds up the execution of the problem.

Multi Thread in Python run in parallel

I have a question about multithread in Python.
I already tried Multithread and MultiProcessing in python.
What I get is
in MultiThread, I will get a duplicate result when run it pararelly. After research, I found that in the multiThread, the Multithread can update the same variable(race Condition).
Meanwhile, in the multi processing, it will go smoothly, without problem like in the Multithread.
The question,
Can I use Multithread, but the mechanism is like Multiprocessing? Because I need to migrate more than 2 million record, and I need to run that function asynchronously as much as possible (My laptop only have 4 cores) that's why I need to use multiThread.
Can some explain to me about the question above?

In multithreading, each thread will share the same memory space as the parent process that spawned them. But in multi-processing, each process have their own memory space.
However, in multithreading, you need to use a lock (semaphore/mutex), (e.g. threading.Lock() to prevent the race condition. It is not to say that multiprocessing will not have race condition, it can have it if you specifically share the same object and not the copy of it. But by default it will copy the object.
Multithreading is also limited by python's GIL (Global Interpreter Lock) which ensures that only one thread is running at the same time. So if you have intensive computation task running on two threads, it doesn't really make it faster, as only one can be active at the same time.
However, multiprocessing can overcome it easily, as it runs on multiple process and each process will be handled by OS's scheduler and run parallely.
General rule of thumb:
if your process is computationally intensive, use process
if your process is I/O intensive, use threads
If your thread needs concurrent access to the same var/object, etc, you need to use lock.

How are threads different from process in terms of how they are executed on hardware level?

I was wondering how the threads are executed on hardware level, like a process would run on a single processing core and make a context switch on the processor and the MMU in order to switch between processes. How do threads switch? Secondly when we create/spawn a new thread will it be seen as a new process would for the processor and be scheduled as a process would?
Also when should one use threads and when a new process?
I know I probably am sounding dumb right now, that's because I have massive gaps in my knowledge that I would like fill. Thanks in advance for taking the time and explaining things to me. :)

There are a few different methods for concurrency. The threading module creates threads within the same Python process and switches between them, this means they're not really running at the same time. The same happens with the Asyncio module, however this has the additional feature of setting when a thread can be switched.
Then there is the multiprocessing module which creates a separate Python process per thread. This means that the threads will not have access to shared memory but can mean that the processes run on different CPU cores and therefore can provide a performance improvement for CPU bound tasks.
Regarding when to use new threads a good rule of thumb would be:
For I/O bound problems, use threading or async I/O. This is because you're waiting on responses from something external, like a database or browser, and this waiting time can instead be filled by another thread running it's task.
For CPU bound problems use multiprocessing. This can run multiple Python processes on separate cores at the same time.
Disclaimer: Threading is not always a solution and you should first determine whether it is necessary and then look to implement the solution.

Think of it this way: "a thread is part of a process."
A "process" owns resources such as memory, open file-handles and network ports, and so on. All of these resources are then available to every "thread" which the process owns. (By definition, every "process" always contains at least one ("main") "thread.")
CPUs and cores, then, execute these "threads," in the context of the "process" which they belong to.
On a multi-CPU/multi-core system, it is therefore possible that more than one thread belonging to a particular process really is executing in parallel. Although you can never be sure.
Also: in the context of an interpreter-based programming language system like Python, the actual situation is a little bit more complicated "behind the scenes," because the Python interpreter context does exist and will be seen by all of the Python threads. This does add a slight amount of additional overhead so that it all "just works."

On the OS level, threads are units of execution that share the same resources (memory, file descriptors, etc). Groups of threads that belong to different processes are isolated from each other, can't access resources across the process boundary. You can think of a "just process" as a single thread, not unlike any other thread.
OS threads are scheduled like you would expect: if there are several cores, they can run in parallel; if there are more threads / processes ready to run than there are cores, some threads get preempted after some time, paused, and another thread has a chance to run on that core.
In Python, though, the difference between threads (threading module) and processes (multiproceessing module) is drastic.
Python runs in a VM. Threads run within that VM. Objects within the VM are reference-counted, and also are unsafe to concurrently modify. So OS thread scheduling which can preempt one thread in the middle of a VM instruction modifying an object, and give control to another object that accesses the same object, will result in corruption.
This is why the global interpreter lock aka GIL exists. It basically prevents any computational parallelism between Python "threads": only one thread can proceed at a time, no matter how many CPU cores you have. Python threads are only good for waiting for I/O.
Unlike that, multiprocessing runs a parallel VM (Python interpreter) and shares select pieces of data with it in a safe way (by copying, or using shared memory). Such parallel processes can run in parallel and utilize multiple CPU cores.
In short: Python threads ≠ OS threads.

How do I stop a thread, if it becomes inactive after n seconds

I'm setting up a multithreaded python server, and I want to remove threads that have been inactive for n seconds.

The approach I can think of for this situation is that, you must have a daemon that would handle such threads. As much as possible, those threads should have been spawned by that daemon for easier thread tracking, as well as handling the timer for such threads.
If this is not the case (a separate program spawned the thread), you must have established a naming (or tracking) standard enabling you to determine which threads are under your program's scope, so they can be terminated by the daemon accordingly.

Closing the multiprocessing after each parallel job

I am using multiprocessing library of python and also emcee which also uses different threads to implement MCMC. The problem is that even when I close the pool still it seems python uses the processors and slows down the cores and I have no idea what is the efficient way to release the cores after the job is done. Could anybody give me an idea of what I should do?
Update:
My code has been already posted here.

Closing a Pool doesn't stop it from doing work, it just prevents new work items from being added to it:
close()
Prevents any more tasks from being submitted to the pool. Once all the
tasks have been completed the worker processes will exit.
So if you have a lot of queued tasks, closing the Pool won't make any difference in resource usage - all the workers will keep consuming those queued tasks until they're gone. If you want to basically abort all the work items immediately, you have to use pool.terminate:
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected terminate() will be called immediately.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.