Threading vs Multiprocessing

Threading vs Multiprocessing - python

Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?

Threading
Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond
Multiprocessing
As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.
Vectorization
Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this

Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.
The single biggest performance factor in multithreaded execution when there no
locking/barriers involved is data access locality eg. matrix multiplication kernels.
Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.
The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in
size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.
Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.

Related

Basic Mapreduce with threads is slower than sequential version

I am trying to do a word counter with mapreduce using threads, but this version is much slower than the sequential version. With a 300MB text file the mapreduce version takes about 80s, with the sequential version it takes significantly less. My question is due to not understanding why, as I have done all the stages of map reduce (split, mapping, shuffle, reduce) but I can't figure out why it is slower, as I have used about 6 threads to do the test. I was thinking that it could be that the creation of threads was expensive compared to the execution time, but since it takes about 80s I think it is clear that this is not the problem. Could you take a look at the code to see what it is? I'm pretty sure that the code works fine, the problem is that I don't know what is causing the slowness.
One last thing, when using a text file of more than 300MB, the program fills all the ram memory of my computer, is there any way to optimize it?

First of all several disclaimers:
to know the exact reason why the application is slow you need to profile it. In this answer I'm giving some common sense reasoning.
I'm assuming you are using cPython
When you parallelize some algorithm there are several factors that that influence performance. Some of them work in favour of speed (I'l mark them with +) and some against (-). Let's look at them:
you need to split the work first (-)
work is parallel workers is done simultaneously (+)
parallel workers may need to synchronize their work (-)
reduce requires time (-)
In order for you parallel algorithm give you some gain as compared to sequential you need that all factors that speeds things up overweight all factors that drags you down.
Also the gain from #2 should be big enough to compensate for the additional work you need to do as compared to sequential processing (this means that for some small data you will not get any boost as overhead for coordination will be bigger).
The main problems in your implementation now are items #2 and #3.
First of all the workers are not working in parallel. The portion of the task you parallelize is CPU bound. In python threads of a single process cannot use more than one CPU. So in this program they never execute in parallel. They share the same CPU.
Moreover every modification operation they do on the dicts uses locking/unlocking and this is much slower then sequential version that does not require such synchronization.
To speed up your algorithm you need:
use multiprocessing instead of multithreading (this way you can use multiple CPU for processing)
structure the algorithm in a way that does not require synchronization between workers when they do their job (each worker should use its own dict to store intermediate results)

Does it make sense to multi-thread within multiprocessing?

With Python's multiprocessing, would it make sense to have a Pool with a bunch of ThreadPools within them? Say I have something like:
def task(path):
# i/o bound
image = load(path)
# cpu bound but only takes up 1/10 of the time of the i/o bound stuff
image = preprocess(img)
# i/o bound
save(image, path)
Then I'd want to process a list of paths path_list. If I use ThreadPool I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list over multiple processes that each in turn use multiple threads?
Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool I'm using each core up for a single task which is i/o bound. If I use a ThreadPool I only get to use one core.

Does it make sense
Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).
In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.
For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.
Aiming towards maximal performance
From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:
Understand your hardware utilization
Identify the bottleneck and estimate the maximal throughput
Design a solution to achieve that throughput
Implement the design, and optimize until you meet your requirements
In detail:
1. Understand your hardware utilization
In this case, there are a few pieces of hardware involved:
The RAM
The disk
The CPU
Let's look at one "task" and note how it uses the hardware:
Disk (read)
RAM (write)
CPU time
RAM (read)
Disk (write)
2. Identify the bottleneck and estimate the maximal throughput
To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python:
(note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).
# ----------- General consts
input_image_size = 20 * 2 ** 20 # 20MB
output_image_size = 15 * 2 ** 20 # 15MB
# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30 # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30 # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size
# ----------- RAM
ram_bandwidth = 30 * 2 ** 30 # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size
ram_write_throughput = userspace_ram_bandwidth / output_image_size
# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate
# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
(disk_write_throughput, 'Disk write'),
(ram_read_throughput, 'RAM read'),
(ram_write_throughput, 'RAM write'),
(cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
f'Bottleneck: {bottleneck_name}\n'
f'Worker amount: {cpu_cores_needed:.1f}')
This code outputs:
Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3
That means:
The maximum rate we can achieve is around 341.3 tasks per second
The disk is the bottleneck. You might be able to increase your performance by, for example:
Buying more disks
Using ramfs or a similar solution that avoids using the disk altogether
In a system where all the steps in task are executed in parallel, you won't need to dedicate more than one core for running preprocess. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)
Note: the numbers are lying
This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.
3. Design a solution to achieve that throughput
Similarly to what you'd offered in your question, my initial design would be something like:
Have a pool of workers load the images and send them on a queue to the next step (we'll need to be reading using multiple cores to use all available memory bandwidth)
Have a pool of workers process the images and send the results on a queue (the amount of workers should be chosen according to the output of the script above. For the current result, the number is 1)
Have a pool of workers save the processed images to the disk.
The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.
4. Implement the design, and optimize until you meet your requirements
Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)

kmarok's answer is good technical one. But, I would also consider the quote "Premature optimization is the root of all evil" concept.
In short, yes, it make sense. But, do you really need to?
Optimization is a trade off. You compromise code simplicity for better performance. Code simplicity is important; you'll need to further develop, debug, and test your software in the future. This will cost you in time. Simplicity buys you time. You need to be aware of the trade-off when you optimize.
I would first write a multithreaded version and measure it using your hardware.
Then I would try the multiprocessing version, and measure it too.
Does any of the versions, is good enough? It might be. If so, you just made your software simpler, more readable and better maintainable.

Chen's and Kamaork's answers resume most of what is needed to know, but there are 2 missing ideas:
Your code will be A process and not THE process, this means that you need to account of how much resources you have left and not how many you can have (it can even happen within your process, threads are not ilimited); this deadly problem happend to me leaving me with less than half of a celeron for a gui, not good.
The biggest optimization with threads you can do is "prediction" (this refers more specifically to when stuff happens), you can chain the threads in a better way when you know how much it takes to compite and its a consisten wait, reading about the tcp window may give you a better idea of how a delay can be optimized by expecting it and not by forcing it.

How can one use multiprocessing to run a run a function faster?

I am calling a function and saving the two outputs to variables but this process is taking time because the outputs are generated by solving an ODE.
Is it possible to use multiple cores to run the function faster so the values are saved sooner? If so, could someone provide a simple example?
Thank you

Simply running the same code on multiple cores will not make it run faster. It really depends on the type of tasks your doing. Here are some questions you need to find answers to before you can decide if the code will benefit from parallel processing:
Are the steps in your computation sequence dependent? In other words, does one part of the code depend on the calculations done in the previous? Or can some of them be calculated in parallel? Look at Amdahl's law to learn about how much speedup to expect based on how much of your code you can parallelize
Does your code involve lots of reading/writing to disk and memory? Or is it just lots of computation? If you are doing significant reads and rights to disk, then creating multiple processes to do other work while your threads wait for disk can result in significant speedups. But again, this depend on your answer to the previous point about sequence dependency
How long does your code currently take to run? And is the overhead of creating multiple processes going to be more than the time it takes to run sequentially? In your question you don't give specific times - if you're talking about speeding up a task that takes a few seconds then the time required to create multiple processes might be significant compared to the time for the total task. But if you're talking about a task that takes minutes, then the overhead won't be as significant
Have you considered whether your code is is data parallel or task parallel? If so, you can decide if you want to parallelize using CPU or GPU. For large mathematical operations, look at Numpy for CPU-based and Cupy for GPU-based operations.

faster numpy array copy; multi-threaded memcpy?

Suppose we have two large numpy arrays of the same data type and shape, of size on the order of GB's. What is the fastest way to copy all the values from one into the other?
When I do this using normal notation, e.g. A[:] = B, I see exactly one core on the computer at maximum effort doing the copy for several seconds, while the others are idle. When I launch multiple workers using multiprocessing and have them each copy a distinct slice into the destination array, such that all the data is copied, using multiple workers is faster. This is true regardless of whether the destination array is a shared memory array or one that becomes local to the worker. I can get a 5-10x speedup in some tests on a machine with many cores. As I add more workers, the speed does eventually level off and even slow down, so I think this achieves being memory-performance bound.
I'm not suggesting using multiprocessing for this problem; it was merely to demonstrate the possibility of better hardware utilization.
Does there exist a python interface to some multi-threaded C/C++ memcpy tool?
Update (03 May 2017)
When it is possible, using multiple python processes to move data can give major speedup. I have a scenario in which I already have several small shared memory buffers getting written to by worker processes. Whenever one fills up, the master process collects this data and copies it into a master buffer. But it is much faster to have the master only select the location in the master buffer, and assign a recording worker to actually do the copying (from a large set of recording processes standing by). On my particular computer, several GB can be moved in a small fraction of a second by concurrent workers, as opposed to several seconds by a single process.
Still, this sort of setup is not always (or even usually?) possible, so it would be great to have a single python process able to drop into a multi-threaded memcpy routine...

If you are certain that the types/memory layout of both arrays are identical, this might give you a speedup: memoryview(A)[:] = memoryview(B) This should be using memcpy directly and skips any checks for numpy broadcasting or type conversion rules.

Creating a thread for each operation or a some threads for various operations?

For a class project I am writing a simple matrix multiplier in Python. My professor has asked for it to be threaded. The way I handle this right now is to create a thread for every row and throw the result in another matrix.
What I wanted to know if it would be faster that instead of creating a thread for each row it creates some amount threads that each handles various rows.
For example: given Matrix1 100x100 * Matrix2 100x100 (matrix sizes can vary widely):
4 threads each handling 25 rows
10 threads each handling 10 rows
Maybe this is a problem of fine tuning or maybe the thread creation process overhead is still faster than the above distribution mechanism.

You will probably get the best performance if you use one thread for each CPU core available to the machine running your application. You won't get any performance benefit by running more threads than you have processors.
If you are planning to spawn new threads each time you perform a matrix multiplication then there is very little hope of your multi-threaded app ever outperforming the single-threaded version unless you are multiplying really huge matrices. The overhead involved in thread creation is just too high relative to the time required to multiply matrices. However, you could get a significant performance boost if you spawn all the worker threads once when your process starts and then reuse them over and over again to perform many matrix multiplications.
For each pair of matrices you want to multiply you will want to load the multiplicand and multiplier matrices into memory once and then allow all of your worker threads to access the memory simultaneously. This should be safe because those matrices will not be changing during the multiplication.
You should also be able to allow all the worker threads to write their output simultaneously into the same output matrix because (due to the nature of matrix multiplication) each thread will end up writing its output to different elements of the matrix and there will not be any contention.
I think you should distribute the rows between threads by maintaining an integer NextRowToProcess that is shared by all of the threads. Whenever a thread is ready to process another row it calls InterlockedIncrement (or whatever atomic increment operation you have available on your platform) to safely get the next row to process.

In no single case a CPU-bound task will be faster in Python in multi-threaded mode. Due to the Global Interpreter Lock, only one thread can be executed at once (unless you write some C extension and release the lock explicitly).
This applies to standard CPython implementation as well as PyPy. In Jython try to use a thread per core, more does not make sense.
Please also check out the great GIL overview by David Beazley.
On the other hand, if your professor does not mind, you can use multiprocessing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.