PYTHON split ThreadPoolExecutor to multiple CPU Cores - python

Hey I am pretty new to this community and I was wondering if this is possible?
For example I have a ThreadPoolExecutor with:
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Semaphore
lock = Semaphore(1)
# Just a pseudocode example
profileTasks = ["TEST1", "TEST2", "TEST3", "TEST4", ....]
def runTask(index, profile):
lock.acquire()
print(f"{index} with {profile}")
lock.release()
runningLoop = True
while runningLoop:
"""
Launch anyways with PoolExecutor
"""
tasks = []
with ThreadPoolExecutor(max_workers=threads) as executor:
for index, profile in enumerate(profileTasks):
tasks.append(
executor.submit(
runTask, index, profile
)
)
runningLoop = False
When I launch for instance more than 100 tasks the threads take so long to start from the executor, I want to split the workload so if I run 1000 tasks and I have a CPU with example 8 cores I want to split the 1000 tasks by 8 multiprocesses in which I run the thread pool executors.
I hope you understand what I mean, Threading in python is in general not petty smart cause it only uses 1 CPU core.
I tried counting the CPU cores and to execute it in a MultiProcessExecutor but it was a complete failure and froze my CPU.

The mostly used Python implementation (from python.org, commonly referred to as cpython because it is written in C) enforces that only one thread at a time can be executing python bytecodes.
So using threads to speed up computationally intensive applications does not work with this implementation.
If you want to use multiple cores for running the same job, you have to use e.g. a multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. The latter is built on top of multiprocessing.
Based on your comments, if your task is to sent HTTP or other network traffic then a ThreadPoolExecutor might be more appropriate.
Because I/O (be it disk or network) in cpython does not suffer from the aforementioned restriction. In technical terms, the Global Interpreter Lock in cpython is released during I/O, giving other threads time to run.
However, network I/O has its own problems.
If you look at performance;
the CPU running instructions and data from its cache is the fastest. (In order to keep this simple, I will not distinguish between bandwidth and latency here.)
If the CPU has to get data or instructions from memory, that is much slower than from the cache.
Disk I/O (especially HDD) is much slower than memory.
Network I/O is generally much slower than disk.
For example, when writing data (from /dev/zero to a file on disk) I've observed speeds of ≈200 MB/s on a SATA 3 harddisk using ZFS.
When using netcat to blast files from one computer to the other over a gigabit point-to-point ethernet link with no other traffic (probably the best possible case for consumer equipment at this time), I get maximum ≈120 MB/s. When e.g. downloading a video from the internet, I might get in the order of ≈12 MB/s max.
If you want to use a 1000 simultaneous network queries, a couple of things can happen:
You could saturate your internet connection. Instead of the tasks competing for CPU time, they are now competing for network bandwidth. This does not improve throughput nor latency.
Your ISP might restrict throughput.
If all the queries go to the same domain, you might trigger a denial of service attack warning and that domain's firewall will block or restrict your connections.
In short: running a 1000 queries at the same time from a single IP address is probably not a good idea.

Related

Python: How many cores are used by my python program with five processes?

I have a python program consisting of 5 processes outside of the main process. Now I'm looking to get an AWS server or something similar on which I can run the script. But how can I find out how many vCPU cores are used by the script/how many are needed? I have looked at:
import multiprocessing
multiprocessing.cpu_count()
But it seems that it just returns the CPU count that's on the system. I just need to know how many vCPU cores the script uses.
Thanks for your time.
EDIT:
Just for some more information. The Processes are running indefinitely.
On Linux you can use the "top" command at the command line to monitor the real-time activity of all threads of a process id:
top -H -p <process id>
Answer to this post probably lies in the following question:
Multiprocessing : More processes than cpu.count
In short, you have probably hundreds of processes running, but that doesn't mean you will use hundreds of cores. It all depends on utilization, and the workload of the processes.
You can also get some additional info from the psutil module
import psutil
print(psutil.cpu_percent())
print(psutil.cpu_stats())
print(psutil.cpu_freq())
or using OS to receive current cpu usage in python:
import os
import psutil
l1, l2, l3 = psutil.getloadavg()
CPU_use = (l3/os.cpu_count()) * 100
print(CPU_use)
Credit: DelftStack
Edit
There might be some information for you in the following medium article. Maybe there are some tools for CPU usage too.
https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba
Edit 2
A good guideline for how many processes to start depends on the amount of threads available. It's basically just Thread_Count + 1, this ensures your processor doesn't just 'sit around and wait', this however is best used when you are IO bound, think of waiting for files from disk. Once it waits, that process is locked, thus you have 8 others to take over. The one extra is redundancy, in case all 8 are locked, the one that's left can take over right away. You can however in- or decrease this if you see fit.
Your question uses some general terms and leaves much unspecified so answers must be general.
It is assumed you are managing the processes using either Process directly or ProcessPoolExecutor.
In some cases, vCPU is a logical processor but per the following link there are services offering configurations of fractional vCPUs such as those in shared environments...
What is vCPU in AWS
You mention/ask...
... Now I'm looking to get an AWS server or something similar on which I can run the script. ...
... But how can I find out how many vCPU cores are used by the script/how many are needed? ...
You state AWS or something like it. The answer would depend on what your subprocess do, and how much of a vCPU or factional vCPU each subprocess needs. Generally, a vCPU is analogous to a logical processor upon which a thread can execute. A fractional portion of a vCPU will be some limited usage (than some otherwise "full" or complete "usage") of a vCPU.
The meaning of one or more vCPUs (or fractional vCPUs thereto) to your subprocesses really depends on those subprocesses, what they do. If one subprocess is sitting waiting on I/O most of the time, you hardly need a dedicated vCPU for it.
I recommend starting with some minimal least expensive configuration and see how it works with your app's expected workload. If you are not happy, increase the configuration as needed.
If it helps...
I usually use subprocesses if I need simultaneous execution that avoids Python's GIL limitations by breaking things into subprocesses. I generally use a single active thread per subprocess, where any other threads in the same subprocess are usually at a wait, waiting for I/O or do not otherwise compete with the primary active thread of the subprocess. Of course, a subprocess could be dedicated to I/O if you want to separate such from other threads you place in other subprocesses.
Since we do not know your app's purpose, architecture and many other factors, it's hard to say more than the generalities above.
Your computer has hundreds if not thousands of processes running at any given point. How does it handle all of those if it only has 5 cores? The thing is, each core takes a process for a certain amount of time or until it has nothing left to do inside that process.
For example, if I create a script that calculates the square root of all numbers from 1 to say a billion, you will see that a single core will hit max usage, then a split second later another core hits max while the first drops to normal and so on until the calculation is done.
Or if the process waits for an I/O process, then the core has nothing to do, so it drops the process, and goes to another process, when the I/O operation is done, the core can pick the process back, and get back to work.
You can run your multiprocessing python code on a single core, or on 100 cores, you can't really do much about it. However, on windows, you can set affinity of a process, which gives the process access to certain cores only. So, when the processes start, you can go to each one and set the affinity to say core 1 or each one to a separate core. Not sure how you do that on Linux though.
In conclusion, if you want a short and direct answer, I think we can say as many cores as it has access to. If you give them one core or 200 cores, they will still work. However, performance may degrade if the processes are CPU intensive, so I recommend starting with one core on AWS, check performance, and upgrade if needed.
I'll try to do my own summary about "I just need to know how many vCPU cores the script uses".
There is no way to answer that properly other than running your app and monitoring its resource usage. Assuming your Python processes do not spawn subprocesses (which could even be multithreaded applications), all we can say is that your app won't utilize more than 6 cores (as per total number of processes). There's a ton of ways for program to under-utilize CPU cores, like waiting for I/O (disk or network) or interprocess synchronization (shared resources). So to get any kind of understanding of CPU utilization, you really need to measure the actual performance (e.g., with htop utility on Linux or macOS) and investigating the causes of underperforming (if any).
Hope it helps.

Python multiprocessing - reassigning jobs dynamically from pool - without async?

So I have a batch of 1000 tasks that I assign using parmap/python multiprocessing module to 8 cores (dual xeon machine 16 physical cores). Currently this runs using synchronized.
The issue is that usually 1 of the cores lags well behind the other cores and still has several jobs/tasks to complete after all the other cores finished their work. This may be related to core speed (older computer) but more likely due to some of the tasks being more difficult than others - so the 1 core that gets the slightly more difficult jobs gets laggy...
I'm a little confused here - but is this what asynch parallelization does? I've tried using it before, but because this step is part of a very large processing step - it wasn't clear how to create a barrier to force the program to wait until all async processes are done.
Any advice/links to similar questions/answers are appreciated.
[EDIT] To clarify, the processes are ok to run independently, they all save data to disk and do not share variables.
parmap author here
By default, both in multiprocessing and in parmap, tasks are divided in chunks and chunks are sent to each multiprocessing process (see the multiprocessing documentation). The reason behind this is that sending tasks individually to a process would introduce significant computational overhead in many situations. The overhead is reduced if several tasks are sent at once, in chunks.
The number of tasks on each chunk is controlled with chunksize in multiprocessing (and pm_chunksize in parmap). By default, chunksize is computed as "number of tasks"/(4*"pool size"), rounded up (see the multiprocessing source code). So for your case, 1000/(4*4) = 62.5 -> 63 tasks per chunk.
If, as in your case, many computationally expensive tasks fall into the same chunk, that chunk will take a long time to finish.
One "cheap and easy" way to workaround this is to pass a smaller chunksize value. Note that using the extreme chunksize=1 may introduce undesired larger cpu overhead.
A proper queuing system as suggested in other answers is a better solution on the long term, but maybe an overkill for a one-time problem.
You really need to look at creating microservices and using a queue pool. For instance, you could put a list of jobs in celery or redis, and then have the microservices pull from the queue one at a time and process the job. Once done they pull the next item and so forth. That way your load is distributed based on readiness, and not based on a preset list.
http://www.celeryproject.org/
https://www.fullstackpython.com/task-queues.html

Multiple processes with multiple threads in Python

I've heard something about "If you want to get maximum performance from parallel application, you should create as many processes as your computer has CPUs, and in each process -- create some (how many?) threads".
Is it true?
I wrote a piece of code implementing this idiom:
import multiprocessing, threading
number_of_processes = multiprocessing.cpu_count()
number_of_threads_in_process = 25 # some constant
def one_thread():
# very heavyweight function with lots of CPU/IO/network usage
do_main_work()
def one_process():
for _ in range(number_of_threads_in_process):
t = threading.Thread(target=one_thread, args=())
t.start()
for _ in range(number_of_processes):
p = multiprocessing.Process(target=one_process, args=())
p.start()
Is it correct? Will my do_main_work function really run in parallel, not facing any GIL-issues?
Thank you.
It really depends very much on what you're doing.
Keep in mind that in CPython, only one thread at a time can be executing Python bytecode (because of the GIL). So for a computation-intensive problem in CPython threads won't help you that much.
One way to spread out work that can be done in parallel is to use a multiprocessing.Pool. By default this does not use more processes that your CPU has cores. Using many more processes will mainly have them fighting over resources (CPU, memory) than getting useful work done.
But taking advantage of multiple processors requires that you have work for them to do! In other words, if the problem cannot be divided into smaller pieces that can be calculated separately and in parallel, many CPU cores will not be of much use.
Additionally, not al problems are bound by the amount of calculation that has to be done.
The RAM of a computer is much slower than the CPU. If the data-set that you're working on is much bigger than the CPU's caches, reading data from and returning the results to RAM might become the speed limit. This is called memory bound.
And if you are working on much more data than can fit in the machine's memory, your program will be doing a lot of reading and writing from disk. A disk is slow compared to RAM and very slow compared to a CPU, so your program becomes I/O-bound.
# very heavyweight function with lots of CPU/IO/network usage
Lots of CPU will suffer because of GIL, so you'll only get benefit from multiple processes.
IO and network (in fact network is also kind of IO) won't be affected too much by GIL because lock is released explicitly and obtained again after IO operation is completed. There are macro-definitions in CPython for this:
Py_BEGIN_ALLOW_THREADS
... Do some blocking I/O operation ...
Py_END_ALLOW_THREADS
There still be a performance hit because of GIL being utilized in wrapping code, but you still get better performance with multiple threads.
Finally - and this is a general rule - not only for Python: Optimal number of threads/processes depends on what the program is actually doing. Generally if it utilizes CPU intensively, there is almost no performance boost if number of processes is greater than number of CPU cores. For example Gentoo documentation says that optimal number of threads for compiler is CPU cores + 1.
I think the number of threads you are using per process is too high.Usually for any Intel Processor the number of threads per process is 2.The number of cores vary from 2(Intel core i3) to 6(Intel core i7).So at a time when all the processes are running the maximum number of threads will be 6*2=12.

Can threads switch CPUs?

At my workplace there is a shared powerful 24-core server on which we run our jobs. To utilize full power of the multi-core CPU I wrote a multi-threaded version of a long-running program such that 24 threads are run on each core simultaneously (via threading library in Jython).
The program runs speedily if there are no other jobs running. However, I was running a big job simultaneously on one core and as a result the thread running on that particular core took long amount of time, slowing down the entire program (as threads needed to join the data at the end). However the threads on other CPUs had long finished execution - so I basically had 23 cores idle and 1 core running the thread and the heavy job, or at least this is what my diagnosis is. This was further confirmed by looking at output of time command, sys time was very low compared to user time (which means there was lot of waiting).
Does operating system (Linux in this case) not switch jobs to different CPUs if one CPU is loaded while others are idle? If not, can I do that in my program (in Jython). It should not be difficult to query different CPU loads once in a while and then switch to one that is relatively free.
Thanks.
Source http://www.ibm.com/developerworks/linux/library/l-scheduler/:
To maintain a balanced workload across CPUs, work can be
redistributed, taking work from an overloaded CPU and giving it to an
underloaded one. The Linux 2.6 scheduler provides this functionality
by using load balancing. Every 200ms, a processor checks to see
whether the CPU loads are unbalanced; if they are, the processor
performs a cross-CPU balancing of tasks.
A negative aspect of this process is that the new CPU's cache is cold
for a migrated task (needing to pull its data into the cache).
Looks like Linux has been balancing threads across cores for a while now.
However, assuming Linux load balances instantly (which it doesn't), your problem still reduces to one where you have 23 cores and 24 tasks. In the worst case (where all tasks take equally long), this takes twice as long as having only 23 tasks because, if they all take equally long to complete, then the last task still has to wait for another task to run to completion before there is a free core.
If the wall-clock time of the program suffers by a slowdown of around 2x, this is probably the issue.
If it is drastically worse than 2x, then you may be on an older version of the Linux scheduler.

Python/Redis Multiprocessing

I'm using Pool.map from the multiprocessing library to iterate through a large XML file and save word and ngram counts into a set of three redis servers. (which sit completely in memory) But for some reason all 4 cpu cores sit around 60% idle the whole time. The server has plenty of RAM and iotop shows that there is no disk IO happening.
I have 4 python threads and 3 redis servers running as daemons on three different ports. Each Python thread connects to all three servers.
The number of redis operations on each server is well below what it's benchmarked as capable of.
I can't find the bottleneck in this program? What would be likely candidates?
Network latency may be contributing to your idle CPU time in your python client application. If the network latency between client to server is even as little as 2 milliseconds, and you perform 10,000 redis commands, your application must sit idle for at least 20 seconds, regardless of the speed of any other component.
Using multiple python threads can help, but each thread will still go idle when a blocking command is sent to the server. Unless you have very many threads, they will often synchronize and all block waiting for a response. Because each thread is connecting to all three servers, the chances of this happening are reduced, except when all are blocked waiting for the same server.
Assuming you have uniform random distributed access across the servers to service your requests (by hashing on key names to implement sharding or partitioning), then the odds that three random requests will hash to the same redis server is inversely proportional to the number of servers. For 1 server, 100% of the time you will hash to the same server, for 2 it's 50% of the time, for 3 it's 33% of the time. What may be happening is that 1/3 of the time, all of your threads are blocked waiting for the same server. Redis is a single-threaded at handling data operations, so it must process each request one after another. Your observation that your CPU only reaches 60% utilization agrees with the probability that your requests are all blocked on network latency to the same server.
Continuing the assumption that you are implementing client-side sharding by hashing on key names, you can eliminate the contention between threads by assigning each thread a single server connection, and evaluate the partitioning hash before passing a request to a worker thread. This will ensure all threads are waiting on different network latency. But there may be an even better improvement by using pipelining.
You can reduce the impact of network latency by using the pipeline feature of the redis-py module, if you don't need an immediate result from the server. This may be viable for you, since you are storing the results of data processing into redis, it seems. To implent this using redis-py, periodically obtain a pipeline handle to an existing redis connection object using the .pipeline() method and invoke multiple store commands against that new handle the same as you would for the primary redis.Redis connection object. Then invoke .execute() to block on the replies. You can get orders of magnitude improvement by using pipelining to batch tens or hundreds of commands together. Your client thread won't block until you issue the final .execute() method on the pipeline handle.
If you apply both changes, and each worker thread communicates to just one server, pipelining multiple commands together (at least 5-10 to see a significant result), you may see greater CPU usage in the client (nearer to 100%). The cpython GIL will still limit the client to one core, but it sounds like you are already using other cores for the XML parsing by using the multiprocessing module.
There is a good writeup about pipelining on the redis.io site.

Categories