I parallelized a function that (i) generates a list of sublists (ii) parameterizes a matrix with each sublist and (iii) returns the eigenvalues of the matrix. I parallelized step (ii) and (iii) so that the eigenvalue calculation for each sublist is parallelized. Then I run this function for many iterations but when I am monitoring the execution time for each iteration I observe that every execution of the function takes longer and longer (iter 1 -0.705 mins, iter 2 - 0.711 mins, iter 3 - 0.717 mins and so on). The pseudocode is given below
import multiprocessing as mp
def parallelized_func():
global calc_eig
pool = mp.Pool()
def calc_eig(sublist):
max_eig = other_module.parameterize_and_decompose(sublist)
return max_eig
list_of_sublists = some_generator.generate(n_sublists = 100)
max_eig = pool.map(calc_eig, list_of_sublists)
pool.close()
return max(max_eig)
for iter in range(10000):
start = time.time()
this_max_eig = parallelized_func()
end = time.time()
print(end-start)
I am not sure why the executions are taking longer with each iteration. I am monitoring memory and core utilization through htop and there doesn't seem to be any abnormalities. I also checked the other modules I am using inside the function and they do not have increasing execution times so I am guessing the problem is the way I am using Pool. Any feedback would be appreciated.
Related
The below code is taking around 15 seconds to get the result. But when I run a it sequentially it
only takes around 11 seconds. What can be the reason for this ?
import multiprocessing
import os
import time
def square(x):
# print(os.getpid())
return x*x
if __name__=='__main__':
start_time = time.time()
p = multiprocessing.Pool()
r = range(100000000)
p1 = p.map(square,r)
end_time = time.time()
print('time_taken::',end_time-start_time)
Sequential code
start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()
Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:
First, you have to take the administrative overhead into account. This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations. Therefore, you gain performance, when the computation's complexity exceeds a certain threshold. Which was not the case in your example above.
Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed. In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool, so that each process has a self contained package of computations to perform.
E.g., this could be accomplished with the following modifications of your code:
def square(x):
return x ** 2
def square_chunk(i, j):
return list(map(square, range(i, j)))
def calculate_in_parallel(n, c=4):
"""Calculates a list of squares in a parallelized manner"""
result = []
step = math.ceil(n / c)
with Pool(c) as p:
partial_results = p.starmap(
square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
)
for res in partial_results:
result += res
return result
Please note, that I used the operation x**2 (instead of the heavily optimized x*x) to increase the load and underline resulting runtime differences.
Here, the Pool's starmap()-function is used which unpacks arguments of the passed tuples. Using it, we can effectively pass more than one argument to the mapped function. Furthermore, we distribute the workload evenly to the amount of available cores. On each core the range of numbers between (i, min(i + step, n)) is calculated, whereas the step denotes the chunksize, calculated as the maximum_number divided by the count of CPU.
By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n) increases. As expected, when more cores are used in parallel the runtime is reduced as well.
Edit:
As #KellyBundy pointed out, parallelism (especially) shines, when you minimize not only the input to the worker processes but the output as well. Performing several measurements calculating the sum of the squared numbers (sum(map(square, range(i, j)))) instead of returning (and concatenating) lists, showed an even larger increase in runtime performance as the following figure illustrates.
I just write an example for working on list and parallel on Numba as bellow by Parallel and No Parallel:
Parallel
#njit(parallel=True)
def evaluate():
n = 1000000
a = [0]*n
sum = 0
for i in prange(n):
a[i] = i*i
for i in prange(n):
sum += a[i]
return sum
No parallel
def evaluate2():
n = 1000000
a = [0]*n
sum = 0
for i in range(n):
a[i] = i*i
for i in range(n):
sum += a[i]
return sum
and compare the time of evaluation
t.tic()
print(evaluate())
t.toc()
result: 333332833333500000
Elapsed time is 0.233338 seconds.
t.tic()
print(evaluate2())
t.toc()
result: 333332833333500000
Elapsed time is 0.195136 seconds.
Full code can get from Colab
The answer is that the number of operations is still small. When I changed n to 100,000,000, the performance change significantly.
I haven't tried it in Numba yet. However, this exactly happens in Matlab or other programming languages when CPU is used as a non-parallel processor and GPU is used for parallel processing. When small data is processed, the CPU exceeds GPU in processing speed and parallel computing is not useful. Parallel processing is efficient only when the processing data size exceeds a certain value. There are benchmarks that show you when the parallel processing is efficient. I have read papers that they put switches in their codes for choosing between CPU and GPU during processing. Try the same code with a large array and compare the results.
I want to parallelise the operation of a function on each element of a list using ray. A simplified snippet is below
import numpy as np
import time
import ray
import psutil
num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus)
#ray.remote
def f(a, b, c):
return a * b - c
def g(a, b, c):
return a * b - c
def my_func_par(large_list):
# arguments a and b are constant just to illustrate
# argument c is is each element of a list large_list
[f.remote(1.5, 2, i) for i in large_list]
def my_func_seq(large_list):
# arguments a anf b are constant just to illustrate
# argument c is is each element of a list large_list
[g(1.5, 2, i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print(time.time() - s)
>>> 2.007
s = time.time()
my_func_seq(my_list)
print(time.time() - s)
>>> 0.0372
The problem is, when I time my_func_par, it is much slower (~54x as can be seen above) than my_func_seq. One of the authors of ray does answer a comment on this blog that seems to explain what I am doing is setting up len(large_list) different tasks, which is incorrect.
How do I use ray and modify the code above to run it in parallel? (maybe by splitting large_list into chunks with the number of chunks being equal to the number of cpus)
EDIT: There are two important criteria in this question
The function f needs to accept multiple arguments
It may be necessarry to use ray.put(large_list) so that the larg_list variable can be stored in shared memory rather than copied to each processor
To add to what Sang said above:
Ray Distributed multiprocessing.Pool supports a fixed-size pool of Ray Actors for easier parallelization.
import numpy as np
import time
import ray
from ray.util.multiprocessing import Pool
pool = Pool()
def f(x):
# time.sleep(1)
return 1.5 * 2 - x
def my_func_par(large_list):
pool.map(f, large_list)
def my_func_seq(large_list):
[f(i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print('Parallel time: ' + str(time.time() - s))
s = time.time()
my_func_seq(my_list)
print('Sequential time: ' + str(time.time() - s))
With the above code, my_func_par runs much faster (about 0.1 sec). If you play with the code and make f(x) slower by something like time.sleep, you can see the clear advantage of multiprocessing.
The reason why the parallized version is slower is that running ray tasks unavoidably have overhead to run (although it puts lots of effort to optimize it). It is because running things in parallel requires to have inter-process communication, serialization, and things like that.
That being said, if your function is really fast (as fast as the running function takes less time than other overhead in distributed computation, in which your code is perfectly the case because the function f is really really tiny. I assume it will take less than a microsecond to run that function).
This means you should make f function more computationally heavier in order to get benefit from parallelization. Your proposed solution might not work because even after that, the function f might be still lightweight enough depending on your list size.
By using the map function in the multiprocessing library I see no difference in execution time when using more than 2 processes. I'm running the program using 4 cores.
The actual code is pretty straight forward and calculates the first 4000 fibonacci numbers 4 times (= the amount of cores). It distributes the work evenly between N cores (e.g. when using a Pool with 2 processes, each process will calculate the first 4000 fibonacci numbers twice). This whole process is done for N = 1 up to the amount of cores.
The output, with in every row the amount of cores and the corresponding execution time in seconds, is:
3,147
1,72
1,896
1.899
Does anyone know why there is no decrease in execution time given more than 2 cores? The actual code is:
import multiprocessing
from time import time
def F(_):
for n in range(4 * 10 ** 3):
a, b = 0, 1
for i in range(0, n):
a, b = b, a + b
return
def pool_fib():
n_cores = multiprocessing.cpu_count()
args = list(range(multiprocessing.cpu_count()))
for i in range(1, n_cores + 1):
with multiprocessing.Pool(i) as p:
start = time()
p.map(F, args)
print(i, time() - start)
if __name__ == '__main__':
pool_fib()
Provided you are using a fairly modern CPU, multiprocessing.cpu_count() won't give you the number of physical cores your machine has, but the number of hyper-threads. In a nutshell, hyper-threading allows a single physical core to have n (most commonly, two) pipelines, which fools your OS into thinking you've got n times the number of cores you really have. This is helpful, when you are doing some stuff that might starve a core with data (most notably, IO or RAM lookups caused by cache-misses), but your workload is purely arithmetic and it is not likely to starve your CPU, resulting in little to no gains from hyper-threading. And the little gains you might get will be overshadowed by multiprocessing overhead, which is quite significant.
P.S.
I usually post this sort of things as comments, but I've exceeded the comment size limitation. By the way, if you've chosen the Fibonacci series for something more that just an example, you might want to consider a faster algorithm: Fast Fibonacci computation
I want to calculate the runtime of two different algorithms in the same program. When I wrote a program calculating the runtime of each individually, I obtained very different results, so to test this new program, I had python calculate the runtime for the same algorithm twice. When I did this (in the program found below), I found that the runtimes of the same algorithm were in fact different! What am I missing and how do I fix this so I can compare algorithms?
import timeit
def calc1(x):
return x*x+x+1
def calc2(x):
return x*x+x+1
def main():
x = int(input("Input a number to be tested: "))
start1 = timeit.default_timer()
result1 = calc1(x)
end1 = timeit.default_timer()
start2 = timeit.default_timer()
result2 = calc2(x)
end2 = timeit.default_timer()
print("Result of calculation 1 was {0}; time to compute was {1} seconds.".format(result1,end1-start1))
print("Result of calculation 2 was {0}; time to compute was {1} seconds.".format(result2,end2-start2))
main()
I think you're being bitten by Windows power management in one part, and the second part, your testing method is flawed.
In the first case, I also got bitten by the fact that Windows, by default, will throttle CPU throughput in order to save power. Currently, there is a dramatic difference in your calculated runtimes; this can be dramatically reduced just by having a nonsense calculation like something = 5**1000000 immediately after x = int(input("Input a number to be tested: ")) to ramp up the CPU resources. I hate Windows 10 so I don't know how to change this off the top of my head; I need to check how you shift Power Options" to "High Performance" and remove CPU throttling (Edit soon), which will reduce this gap considerably.
The second issue is that you only run one test cycle. You cannot get stable results from this. Instead, you need multiple iterations. For example, with a million iterations you would see some similarity between the numbers:
import timeit
exec1 = timeit.timeit(stmt='def calc1():\n return x*x+x+1', number=1000000)
exec2 = timeit.timeit(stmt='def calc2():\n return x*x+x+1', number=1000000)
print("Execution of first loop: {}".format(exec1))
print("Execution of second loop: {}".format(exec2))
Depending on your IDE (if it's Canopy/Spyder) then there could be cleaner ways of running timeit such as using your existing definitions of calc1 and calc2.