Non-monotonic evolution of runtime with increasing parallelization - python

I'm running some runtime tests to understand what I can gain from parallelization and how it affects runtime (linearly?).
For a given integer n I successively compute the n-th Fibonacci number and vary the degree of parallelization by allowing to compute each Fibonacci number i in {0,1,...,n} by using up to 16 parallel processes.
import pandas as pd
import time
import multiprocessing as mp
# n-te Fibonacci Zahl
def f(n: int):
if n in {0, 1}:
return n
return f(n - 1) + f(n - 2)
if __name__ == "__main__":
K = range(1, 16 + 1)
n = 100
N = range(n)
df_dauern = pd.DataFrame(index=K, columns=N)
for _n in N:
_N = range(_n)
print(f'\nn = {_n}')
for k in K:
start = time.time()
pool = mp.Pool(k)
pool.map(f, _N)
pool.close()
pool.join()
ende = time.time()
dauer = ende - start
m, s = divmod(dauer, 60)
h, m = divmod(m, 60)
h, m, s = round(h), round(m), round(s)
df_dauern.loc[k, _n] = f'{h}:{m}:{s}'
print(f'... k = {k:02d}, Dauer: {h}:{m}:{s}')
df_dauern.to_excel('Dauern.xlsx')
In the following DataFrame I display the duration (h:m:s) for n in {45, 46, 47}.
45 46 47
1 0:9:40 0:15:24 0:24:54
2 0:7:24 0:13:23 0:22:59
3 0:5:3 0:9:37 0:19:7
4 0:7:18 0:7:19 0:15:29
5 0:7:21 0:7:17 0:15:35
6 0:3:41 0:9:34 0:9:36
7 0:3:40 0:9:46 0:9:34
8 0:3:41 0:9:33 0:9:33
9 0:3:39 0:9:33 0:9:33
10 0:3:39 0:9:32 0:9:32
11 0:3:39 0:9:34 0:9:45
12 0:3:40 0:6:4 0:9:37
13 0:3:39 0:5:54 0:9:32
14 0:3:39 0:5:55 0:9:32
15 0:3:40 0:5:53 0:9:33
16 0:3:39 0:5:55 0:9:33
In my opinion the results are odd in two dimensions. First, the duration is not monotonically decreasing for increasing parallelization and second, runtime is not linearly decreasing (that is, double processes, half runtime).
Is this behavior to be expected?
Is this behavior due to the chosen example of computing Fibonacci numbers?
How is it even possible that runtime increases with increasing parallelization (e.g. always when moving from 2 to 3 parallel processes)?
How come it does not make a difference whether I use 6 or 16 parallel processes?

it's because of multiprocessing scheduling algorithm and the fact that the task has factorial complexity, by default the pool will choose a chunksize that is relative to the number of workers
basically multiprocessing splits work into equal chunks to reduce serialization overhead, the chunksize is given by.
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
chunksize += bool(extra)
for 4 and 5 workers, the chunk size is the same (3), and 99.9% of the time is taken by the last 3 tasks, which are scheduled in the same core (because they are in 1 chunk), so 1 core ends up doing 99.9% of the work regardless of the core count, the extra 3 seconds are most likely scheduling overhead (more workers = more scheduling), you'll get a speedup if you set the chunksize=1 in pool.map parameters manually, as each of these 3 tasks will be scheduled to a different core.
for worker number higher than 6, the chunksize is calculated to be 2, but you have an odd number of tasks, which means you will always wait for the last task that is scheduled, which is the longest one, the entire 3:40 minutes are in a single function, it cannot be broken down further, so it doesn't matter if you launch 6 workers or a 100, you are still limited by the slowest task (or actually the slowest chunk).

Related

Why does python multiprocessing run sequentially?

I try to speeds up the dot product of two large matrice, so I test a small example of multiprocessing. The codes are as follows. But from the results, I found that my codes runs like sequentially.
Codes
import multiprocessing as mp
import numpy as np
import time
def dot(i):
print(f"Process {i} enters")
np.random.seed(10)
a = np.random.normal(0, 1, (5000, 5000))
b = np.random.normal(0, 1, (5000, 5000))
print(f"Process {i} starts calculating")
res = np.dot(a, b)
print(f"Process {i} finishes")
return res
if __name__ == '__main__':
start = time.perf_counter()
dot(1)
print(time.perf_counter() - start)
print('=============================')
print(mp.cpu_count())
i = 8
start = time.perf_counter()
pool = mp.Pool(mp.cpu_count())
res = []
for j in range(i):
res.append(pool.apply_async(dot, args=(j,)))
pool.close()
pool.join()
end = time.perf_counter()
# res = [r.get() for r in res]
# print(res)
print(end - start)
Results
Process 1 enters
Process 1 starts calculating
Process 1 finishes
2.582571708
=============================
8
Process 0 enters
Process 1 enters
Process 2 enters
Process 3 enters
Process 4 enters
Process 5 enters
Process 6 enters
Process 7 enters
Process 4 starts calculating
Process 7 starts calculating
Process 5 starts calculating
Process 3 starts calculating
Process 1 starts calculating
Process 6 starts calculating
Process 0 starts calculating
Process 2 starts calculating
Process 4 finishes
Process 7 finishes
Process 1 finishes
Process 0 finishes
Process 6 finishes
Process 2 finishes
Process 5 finishes
Process 3 finishes
27.05124225
The results showed that the codes seems to run indeed parallelly (from the text), but the final running time seems run sequentially. I don't know why, so hope some one could give me some advice. Thanks in advance.
Of course there is always additional overhead involved in creating processes and in passing arguments and results between address spaces (and in this case your results are extremely large).
My best guess is that the performance problem is arising because the storage requirements for running 8 processes in parallel (I assume you have at least 8 logical processors, preferably 8 physical processors) due to the large arrays being computed is probably causing extreme paging (I get the same results as you). I have therefore modified the demo to be less memory intensive but kept the CPU requirements high by performing the dot function many times in a loop. I have also reduced the number of processes to 4, which is the number of physical processors that I have on my desktop, which gives each process a better chance of running in parallel:
from multiprocessing.pool import Pool
import numpy as np
import time
def dot(i):
print(f"Process {i} enters")
np.random.seed(10)
a = np.random.normal(0, 1, (50, 50))
b = np.random.normal(0, 1, (50, 50))
print(f"Process {i} starts calculating")
for _ in range(500_000):
res = np.dot(a, b)
print(f"Process {i} finishes")
return res
if __name__ == '__main__':
start = time.perf_counter()
dot(1)
print(time.perf_counter() - start)
print('=============================')
i = 4
start = time.perf_counter()
pool = Pool(i)
res = []
for j in range(i):
res.append(pool.apply_async(dot, args=(j,)))
pool.close()
pool.join()
end = time.perf_counter()
# res = [r.get() for r in res]
# print(res)
print(end - start)
Results:
Process 1 enters
Process 1 starts calculating
Process 1 finishes
6.0469717
=============================
Process 0 enters
Process 0 starts calculating
Process 1 enters
Process 1 starts calculating
Process 2 enters
Process 3 enters
Process 2 starts calculating
Process 3 starts calculating
Process 0 finishes
Process 1 finishes
Process 3 finishes
Process 2 finishes
8.8419177
This is much closer to what you would expect. When I change i to 8, the number of logical processors, then the running times were 6.1023760000000005 and 12.749368100000002 respectively.

Increase number of CPUs (ncores) has negative impact on multiprocessing pool

I have the following code and I want to spread the task into multi-process. After experiments, I realized that increase the number of CPU cores negatively impacts the execution time.
I have 8 cores on my machine
Case 1: without using multiprocessing
Execution time: 106 minutes
Case 2: with multiprocessing using ncores = 4
Execution time: 37 minutes
Case 3: with multiprocessing using ncores = 7
Execution time: 40 minutes
the following code:
import time
import multiprocessing as mp
def _fun(i, args1=10):
#Sort matrix W
#For loop 1 on matrix M
#For loop 2 on matrix Y
return value
def run1(ncores=mp.cpu_count()):
ncores = ncores - 4 # use 4 and 1 to have ncores = 4 and 7
_f = functools.partial(_fun,args1=x)
with mp.Pool(ncores) as pool:
result = pool.map(_f, range(n))
return [t for t in result]
start = time.time()
list1= run1()
end = time.time()
print( 'time {0} minutes '.format((end - start)/60))
My question, what is the best practice to use multiprocessing? As I understand that as much we use cpu cores as much it will be faster.

A while loop time complexity

I'm interested in determining the big O time complexity of the following:
def f(x):
r = x / 2
d = 1e-10
while abs(x - r**2) > d:
r = (r + x/r) / 2
return r
I believe this is O(log n). To arrive at this, I merely collected empirical data via the timeit module and plotted the results, and saw that a plot that looked logarithmic using the following code:
ns = np.linspace(1, 50_000, 100, dtype=int)
ts = [timeit.timeit('f({})'.format(n),
number=100,
globals=globals())
for n in ns]
plt.plot(ns, ts, 'or')
But this seems like a corny way to go about figuring this out. Intuitively, I understand that the body of the while loop involves dividing an expression by 2 some number k times until the while expression is equal to d. This repeated division by 2 gives something like 1/2^k, from which I can see where a log is involved to solve for k. I can't seem to write down a more explicit derivation, though. Any help?
This is Heron's (Or Babylonian) method for calculating the square root of a number. https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
Big O notation for this requires a numerical analysis approach. For more details on the analysis you can check the wikipedia page listed or look for Heron's error convergence or fixed point iteration. (or look here https://mathcirclesofchicago.org/wp-content/uploads/2015/08/johnson.pdf)
Broad-strokes, if we can write the error e_n = (x-r_n**2) in terms of itself to where e_n = (e_n**2)/(2*(e_n+1))
Then we can see that e_n+1 <= min{(e_n**2)/2,e_n/2} so we have the error decrease quadratically. With the degrees of accuracy effectively doubling each iteration.
Whats different between this analysis and Big-O, is that the time it takes does NOT depend on the size of the input, but instead of the wanted accuracy. So in terms of input, this while loop is O(1) because its number of iterations is bounded by the accuracy not the input.
In terms of accuracy the error is bounded by above by e_n < 2**(-n) so we would need to find -n such that 2**(-n) < d. So log_2(d) = b such that 2^b = d. Assuming d < 2, then n = floor(log_2(d)) would work. So in terms of d, it is O(log(d)).
EDIT: Some more info on error analysis of fixed point iteration http://www.maths.lth.se/na/courses/FMN050/media/material/part3_1.pdf
I believe you're correct that it's O(log n).
Here you can see the successive values of r when x = 100000:
1 50000
2 25001
3 12502
4 6255
5 3136
6 1584
7 823
8 472
9 342
10 317
11 316
12 316
(I've rounded them off because the fractions are not interesting).
What you can see if that it goes through two phases.
Phase 1 is when r is large. During these first few iterations, x/r is tiny compared to r. As a result, r + x/r is close to r, so (r + x/r) / 2 is approximately r/2. You can see this in the first 8 iterations.
Phase 2 is when it gets close to the final result. During the last few iterations, x/r is close to r, so r + x/r is close to 2 * r, so (r + x/r) / 2 is close to r. At this point we're just improving the approximation by small amounts. These iterations are not really very dependent on the magnitude of x.
Here's the succession for x = 1000000 (10x the above):
1 500000
2 250001
3 125002
4 62505
5 31261
6 15646
7 7855
8 3991
9 2121
10 1296
11 1034
12 1001
13 1000
14 1000
This time there are 10 iterations in Phase 1, then we again have 4 iterations in Phase 2.
The complexity of the algorithm is dominated by Phase 1, which is logarithmic because it's approximately dividing by 2 each time.

How to improve python threading by using locks?

Hi I want to calculated the summation of integers from 1 to N by using threading to speed up. Thus I wrote:
import threading
N = int(input())
sum = 0
i = 1
lock = threading.Lock()
def thread_worker():
global sum
global i
lock.acquire()
sum += i
i += 1
lock.release()
for j in range(N):
w = threading.Thread(target = thread_worker)
w.start()
Because I don't want to mess up my variable i, I used the lock function. However the speed of code is not improved since only one thread can access the shared variable at a time.
So I want to know is there anyway I can improve the threading runtime? Will adding more variables and locks be helpful?
Thanks!
I feel like your question deserves two answers:
How to use threading in the right way.
One for how to speed up the summation of integers between 1 and N.
How to use threading
There is always some overhead to threading: Creating threads, locking, etc. Even if this overhead is really small, some overhead will remain. This means that threading is only worth it, if each thread has some significant work to be done. Your function does almost nothing inside the lock.acquire()/release(). Threads would work, if you were doing some more complicated work there. But you are using the lock function correctly -- this is just not a good problem for threading.
Fast Summation of 1..N
Measure, Measure, Measure
I'm running Python 3.7, and have pytest-benchmark 3.2.2 installed. Pytest-benchmark runs a function for you an appropriate number of times, and reports statistics on the runtime.
import threading
import pytest
def compute_variant1(N: int) -> int:
global sum_
global i
sum_ = 0
i = 1
lock = threading.Lock()
def thread_worker():
global sum_
global i
lock.acquire()
sum_ += i
i += 1
lock.release()
threads = []
for j in range(N):
threads.append(threading.Thread(target = thread_worker))
threads[-1].start()
for t in threads:
t.join()
return sum_
#pytest.mark.parametrize("func", [
compute_variant1])
#pytest.mark.parametrize("N,expected", [(10, 55), (30, 465), (100, 5050)])
def test_var1(benchmark, func, N, expected):
result = benchmark(func, N=N )
assert result == expected
Run this with: py.test --benchmark-histogram=bench And open the generated bench.svg It also prints this table:
----------------------------------------------------------------------------------------------------- benchmark: 3 tests -----------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_var1[10-55-compute_variant1] 570.0710 (1.0) 4,900.3160 (1.51) 658.8392 (1.0) 270.6056 (1.36) 602.5220 (1.0) 42.6373 (1.0) 19;82 1,517.8209 (1.0) 529 1
test_var1[30-465-compute_variant1] 1,701.2970 (2.98) 3,237.5830 (1.0) 1,879.8490 (2.85) 198.3905 (1.0) 1,802.5160 (2.99) 146.4465 (3.43) 59;43 531.9576 (0.35) 432 1
test_var1[100-5050-compute_variant1] 5,809.7500 (10.19) 13,354.3520 (4.12) 6,698.4778 (10.17) 1,413.1428 (7.12) 6,235.4355 (10.35) 766.0440 (17.97) 6;7 149.2876 (0.10) 74 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simpler Algorithms are Faster
Often it is really helpful to compare a fast implementation against the most straightforward one.
The simplest implementation I can think of -- without using Python tricks -- is a simple loop:
def compute_variant2(N):
sum_ = 0
for i in range(1, N+1):
sum_ += i
return sum_
Python has a function called sum() which takes a list or any iterable like range(N):
def compute_variant3(N: int) -> int:
return sum(range(1, N+1))
Here are the results (I've removed some columns):
------------------------------------------------------- benchmark: 3 tests ------------------------------------------------
Name (time in us) Mean StdDev Median Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------
test_var1[100-5050-compute_variant2] 4.2328 (1.0) 1.6411 (1.0) 4.1150 (1.0) 163106 1
test_var1[100-5050-compute_variant3] 4.7113 (1.11) 1.6773 (1.02) 4.5560 (1.11) 141744 1
test_var1[100-5050-compute_variant1] 6,404.1856 (>1000.0) 668.2502 (407.21) 6,257.6385 (>1000.0) 106 1
---------------------------------------------------------------------------------------------------------------------------
As you can see the variant1 based on threading is many, many times slower than the sequential implementations.
Even faster using Mathematics
Carl Friedrich Gauss, when he was a child 1, re-discovered aclose-form solution to our problem: N * (N + 1) / 2.
This can be expressed in Python:
def compute_variant4(N: int) -> int:
return N * (N + 1) // 2
Let's compare that to the other fast implementations (I'll leave out the first implementation):
And as you can see in the table: The last variant is faster, and importantly independent of N.
---------------------------------------------------------------------------------------------------- benchmark: 12 tests -----------------------------------------------------------------------------------------------------
Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
var1[10-55-compute_variant4] 162.0100 (1.0) 1,053.6800 (1.0) 170.1799 (1.0) 32.5862 (1.0) 163.7200 (1.0) 8.1800 (1.59) 1135;1225 5,876.1353 (1.0) 58310 100
var1[13-91-compute_variant4] 162.1400 (1.00) 1,354.3200 (1.29) 181.5037 (1.07) 46.2132 (1.42) 176.1800 (1.08) 5.1500 (1.0) 2214;14374 5,509.5296 (0.94) 61452 100
var1[30-465-compute_variant4] 188.8900 (1.17) 1,874.2800 (1.78) 200.4983 (1.18) 40.8533 (1.25) 191.0000 (1.17) 9.8300 (1.91) 1245;1342 4,987.5732 (0.85) 51919 100
var1[100-5050-compute_variant4] 192.9091 (1.19) 5,938.7273 (5.64) 209.0628 (1.23) 75.8845 (2.33) 198.5455 (1.21) 10.9091 (2.12) 1879;4696 4,783.2508 (0.81) 194515 22
var1[10-55-compute_variant2] 676.1000 (4.17) 18,987.8000 (18.02) 719.4231 (4.23) 194.5556 (5.97) 689.0000 (4.21) 34.9000 (6.78) 1447;2199 1,390.0027 (0.24) 125898 10
var1[13-91-compute_variant2] 753.9000 (4.65) 12,103.8000 (11.49) 799.2837 (4.70) 201.3654 (6.18) 766.5000 (4.68) 38.1000 (7.40) 1554;3049 1,251.1203 (0.21) 124441 10
var1[10-55-compute_variant3] 1,021.0000 (6.30) 77,718.8000 (73.76) 1,157.6125 (6.80) 544.1982 (16.70) 1,098.2000 (6.71) 73.0000 (14.17) 3802;12244 863.8469 (0.15) 186672 5
var1[13-91-compute_variant3] 1,127.6000 (6.96) 44,606.4000 (42.33) 1,279.9332 (7.52) 476.7700 (14.63) 1,200.2000 (7.33) 90.0000 (17.48) 4022;21018 781.2908 (0.13) 172147 5
var1[30-465-compute_variant2] 1,304.7500 (8.05) 48,218.5000 (45.76) 1,457.3923 (8.56) 550.7975 (16.90) 1,385.7500 (8.46) 80.2500 (15.58) 3944;22221 686.1570 (0.12) 177086 4
var1[30-465-compute_variant3] 1,738.6667 (10.73) 86,587.3333 (82.18) 1,935.1659 (11.37) 762.7118 (23.41) 1,860.3333 (11.36) 128.6667 (24.98) 2474;6870 516.7516 (0.09) 176773 3
var1[100-5050-compute_variant2] 3,891.0000 (24.02) 181,009.0000 (171.79) 4,218.4608 (24.79) 1,721.8413 (52.84) 3,998.0000 (24.42) 210.0000 (40.78) 1788;2670 237.0533 (0.04) 171881 1
var1[100-5050-compute_variant3] 4,200.0000 (25.92) 76,885.0000 (72.97) 4,516.5853 (26.54) 1,452.5713 (44.58) 4,343.0000 (26.53) 210.0000 (40.78) 1204;2311 221.4062 (0.04) 153587 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Algorithm for distributing tasks to two printers?

I am doing the programming exercise online, and I found this question:
Two printers work in different speed. The first printer produces one paper in x minutes, while the second does it in y minutes. To print N papers in total, how to distribute the tasks to those printers so the printing time is minimum?
The exercise gives me three inputs x,y,N and asks for the minimum time as output.
input data:
1 1 5
3 5 4
answer:
3 9
I have tried to set tasks for first printer as a, and the tasks for the second printer as N-a. The most efficient way to print is to let them have the same time, so the minimum time would be ((n*b)/(a+b))+1. However this formula is wrong.
Then I tried to use a brute force way to solve this problem. I first distinguished which one is smaller (faster) in a and b. Then I keep adding one paper to the faster printer. When the time needed for that printer is longer than the time to print one paper of the other printer, I give one paper to the slower printer, and subtract the time of faster printer.
The code is like:
def fastest_time (a, b, n):
""" Return the smalles time when keep two machine working at the same time.
The parameter a and b each should be a float/integer referring to the two
productivities of two machines. n should be an int, refering to the total
number of tasks. Return an int standing for the minimal time needed."""
# Assign the one-paper-time in terms of the magnitude of it, the reason
# for doing that is my algorithm is counting along the faster printer.
if a > b:
slower_time_each = a
faster_time_each = b
elif a < b :
slower_time_each = b
faster_time_each = a
# If a and b are the same, then we just run the formula as one printer
else :
return (a * n) / 2 + 1
faster_paper = 0
faster_time = 0
slower_paper = 0
# Loop until the total papers satisfy the total task
while faster_paper + slower_paper < n:
# We keep adding one task to the faster printer
faster_time += 1 * faster_time_each
faster_paper += 1
# If the time is exceeding the time needed for the slower machine,
# we then assign one task to it
if faster_time >= slower_time_each:
slower_paper += 1
faster_time -= 1 * slower_time_each
# Return the total time needed
return faster_paper * faster_time_each
It works when N is small or x and y are big, but it needs a lot of time (more than 10 minutes I guess) to compute when x and y are very small, i.e. the input is 1 2 159958878.
I believe there is an better algorithm to solve this problem, can anyone gives me some suggestions or hints please?
Given the input in form
x, y, n = 1, 2, 159958878
this should work
import math
math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
This works for all your sample inputs.
In [61]: x, y, n = 1,1,5
In [62]: math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
Out[62]: 3.0
In [63]: x, y, n = 3,5,4
In [64]: math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
Out[64]: 9.0
In [65]: x, y, n = 1,2,159958878
In [66]: math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
Out[66]: 106639252.0
EDIT:
This does not work for the case mentioned by #Antti i.e. x, y, n = 4,7,2.
Reason is that we are considering smaller time first. So the solution is to find both the values i.e. considering smaller time and considering larger time, and then choose whichever of the resultant value is smaller.
So, this works for all the cases including #Antii's
min((math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y)),
math.ceil((min((x,y)) / float(x+y)) * n) * max((x,y))))
Although there might be some extreme cases where you might have to change it a little bit.

Categories