I am calling a function and saving the two outputs to variables but this process is taking time because the outputs are generated by solving an ODE.
Is it possible to use multiple cores to run the function faster so the values are saved sooner? If so, could someone provide a simple example?
Thank you
Simply running the same code on multiple cores will not make it run faster. It really depends on the type of tasks your doing. Here are some questions you need to find answers to before you can decide if the code will benefit from parallel processing:
Are the steps in your computation sequence dependent? In other words, does one part of the code depend on the calculations done in the previous? Or can some of them be calculated in parallel? Look at Amdahl's law to learn about how much speedup to expect based on how much of your code you can parallelize
Does your code involve lots of reading/writing to disk and memory? Or is it just lots of computation? If you are doing significant reads and rights to disk, then creating multiple processes to do other work while your threads wait for disk can result in significant speedups. But again, this depend on your answer to the previous point about sequence dependency
How long does your code currently take to run? And is the overhead of creating multiple processes going to be more than the time it takes to run sequentially? In your question you don't give specific times - if you're talking about speeding up a task that takes a few seconds then the time required to create multiple processes might be significant compared to the time for the total task. But if you're talking about a task that takes minutes, then the overhead won't be as significant
Have you considered whether your code is is data parallel or task parallel? If so, you can decide if you want to parallelize using CPU or GPU. For large mathematical operations, look at Numpy for CPU-based and Cupy for GPU-based operations.
Related
I recently started using ray ActorPool to parallelize my python code on my local computer (using the code below), and it's definitely working. Specifically, I used it to process a list of arguments and return a list of results (Note that depending on the inputs, the "process" function could take different amounts of time).
However, while testing the script, it seems in this way the processes are sort of "blocking" each other, in that if there's one process that takes a long time, it almost seems other cores would just stay more or less idle. Although it's definitely not completely blocking, as running it this way still saves a lot of time compared to just running on one core, I found that many of the processors would just stay idle (more than half cores with <20% usage) despite I'm running this script on all cores (16 cores). This is especially observable when there is a long process, in which case there are only one or two cores that are actually active. Also, the total amount of time saved is nowhere near 16x
pool = ActorPool(actors)
poolmap = pool.map(
lambda a, v: a.process.remote(arg),
args,
)
result_list = [a for a in tqdm(poolmap, total=length)]
I suspect this is because the way I used to get the result values is not optimal (last line), but not sure how to make it better. Could you guys help me improve it?
I am trying to do a word counter with mapreduce using threads, but this version is much slower than the sequential version. With a 300MB text file the mapreduce version takes about 80s, with the sequential version it takes significantly less. My question is due to not understanding why, as I have done all the stages of map reduce (split, mapping, shuffle, reduce) but I can't figure out why it is slower, as I have used about 6 threads to do the test. I was thinking that it could be that the creation of threads was expensive compared to the execution time, but since it takes about 80s I think it is clear that this is not the problem. Could you take a look at the code to see what it is? I'm pretty sure that the code works fine, the problem is that I don't know what is causing the slowness.
One last thing, when using a text file of more than 300MB, the program fills all the ram memory of my computer, is there any way to optimize it?
First of all several disclaimers:
to know the exact reason why the application is slow you need to profile it. In this answer I'm giving some common sense reasoning.
I'm assuming you are using cPython
When you parallelize some algorithm there are several factors that that influence performance. Some of them work in favour of speed (I'l mark them with +) and some against (-). Let's look at them:
you need to split the work first (-)
work is parallel workers is done simultaneously (+)
parallel workers may need to synchronize their work (-)
reduce requires time (-)
In order for you parallel algorithm give you some gain as compared to sequential you need that all factors that speeds things up overweight all factors that drags you down.
Also the gain from #2 should be big enough to compensate for the additional work you need to do as compared to sequential processing (this means that for some small data you will not get any boost as overhead for coordination will be bigger).
The main problems in your implementation now are items #2 and #3.
First of all the workers are not working in parallel. The portion of the task you parallelize is CPU bound. In python threads of a single process cannot use more than one CPU. So in this program they never execute in parallel. They share the same CPU.
Moreover every modification operation they do on the dicts uses locking/unlocking and this is much slower then sequential version that does not require such synchronization.
To speed up your algorithm you need:
use multiprocessing instead of multithreading (this way you can use multiple CPU for processing)
structure the algorithm in a way that does not require synchronization between workers when they do their job (each worker should use its own dict to store intermediate results)
Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?
Threading
Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond
Multiprocessing
As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.
Vectorization
Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this
Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.
The single biggest performance factor in multithreaded execution when there no
locking/barriers involved is data access locality eg. matrix multiplication kernels.
Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.
The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in
size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.
Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.
This is probably a stupid question. But, if I have a simple function and I want to run it say 100 times and I have 12 processors available, is it better to use 10 processors to run the multiprocessing code or 12?
Basically by using 12 cores will I be saving one iteration time? or it will run 10 iterations in 1st time and then 2 and then 10 and so on?
It's almost always better to use the number of processors available. However, some algorithms need processes to communicate partial results to achieve an end result (many image processing algorithms have this constraint). Those algorithms have a limit on the number of process that should be running in parallel, as beyond this limit, the cost of communication impairs performances.
However, it depends on a lot of things. Many algorithms are easily parallelizable, however, the cost of parallelism impair their acceleration. Basically, for parallelism to be worth anything, the actual work to be done must be an order of magnitude greater than the cost of parallelism.
In typical multi-threaded languages, you can easily reduce the cost of parallelism by re-using the same threads (thread pooling). However, python being python, you must use multi-processing to achieve true parallelism, which has a huge cost. However, there's a process pool if you wish to re-use processes.
You need to check how much time it takes to run your algorithm sequentially, how much time it takes to run one iteration, and how many iteration will you have. Only then will you know if parallelization is worth it. If it is worth it, then do tests for number of processes going from 1 to 100. This will allow you to find the sweet spot for your algorithm.
I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.