Multiprocessing not using whole CPU

Multiprocessing not using whole CPU - python

I'm testing python's module "multiprocessing". I'm trying to compute pi using a montecarlo technique using my 12 threads ryzen 5 5600.
The problem is that my cpu is not fully used, instead only 47% is used. I leave you my code below, changing the value of n_cpu leads to not so different core usage, instead increasing N by 1 order of magnitude can increase the load up to 77%... but i believed that N shouldn't affect the number of processes...
Please help me understand how to correctly parallelize my code, thanks.
import random
import math
import numpy as np
import multiprocessing
from multiprocessing import Pool
def sample(n):
n_inside_circle = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 < 1.0:
n_inside_circle += 1
return n_inside_circle
N_test=1000
N=12*10**4
n_cpu = 12
pi=0
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
pool = Pool(processes=n_cpu)
results = pool.map(sample, part_count)
pool.close()
pi += sum(results)/(N*1.0)*4
print(pi/N_test)

The lack of cpu use is because you are sending chunks of data to multiple new process pools instead of all at once to a single process pool.
simply using
pool = Pool(processes=n_cpu)
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
results = pool.map(sample, part_count)
pi += sum(results)/(N*1.0)*4
pool.close()
should have some speed up
To optimize this further
We can change the way the jobs are split up to have more samples for a single process.
We can use Numpy's vectorized random functions that will run faster than random.random().
Finally for the last bit of speed, we can use numba with a threadpool to reduce overhead even more.
import time
import numpy as np
from multiprocessing.pool import ThreadPool
from numba import jit
#jit(nogil=True, parallel=True, fastmath=True)
def sample(n):
x = np.random.random(n)
y = np.random.random(n)
inside_circle = np.square(x) + np.square(y) < 1.0
return int(np.sum(inside_circle))
total_samples = int(3e9)
function_limit = int(1e7)
n_cpu = 12
pi=0
assert total_samples%function_limit == 0
start = time.perf_counter()
with ThreadPool(n_cpu) as pool:
part_count=[function_limit] * (total_samples//function_limit)
results = pool.map(sample, part_count)
pi = 4*sum(results)/(total_samples)
end = time.perf_counter()
print(pi)
print(round(end-start,3), "seconds taken")
resulting in
3.141589756
6.982 seconds taken

Related

the first calculation with torch.einsum is much slower

When I run several calculations with torch.einsum in a row, the first one is always much slower than the following calculations.
The following code and plot illustrates the problem:
import torch as tor
from timeit import default_timer as timer
N = 1000
L = 10
time_arr = np.zeros(L)
for i in range(L):
a = tor.randn(N, N).to("cuda:0") #3 random 1000x1000 matrices for each cycle
b = tor.randn(N, N).to("cuda:0")
c = tor.randn(N, N).to("cuda:0")
time_start = timer()
tor.einsum("ij, kj",tor.einsum("ij, ja", aa, ab), ac)
time_end = timer()
time_arr[i] = time_end - time_start
Plot of the different times for each cylce of the loop

python multiprocessing performance decay very fast with core numbers

i've got a new server with 2 intel xeon gold 6138 CPUs, each with 20core/40threads so total 40core/80threads.
im testing it with a very simple task, no IO, just pure calculation. but the per-thread-efficiency decayed really fast.
import numpy as np
from datetime import datetime as dt
from multiprocessing import Pool
def trytrytryshare(i,times):
for j in range(times):
indata[0] * indata[1]
return
def trymultishare(thread = 70 , times = 10):
st = dt.now()
args_l = [(i,times) for i in range(thread)]
print(st)
p = Pool(thread)
for i in range(len(args_l)):
p.apply_async(func = trytrytryshare, args = (args_l[i]))
p.close()
p.join()
print('%d threads finished in %d secs' %(thread,(dt.now()-st).seconds))
return
if __name__ == '__main__':
global indata
size = 10000
x = np.random.rand(size,size)
y = np.random.rand(size,size)
indata = (x,y)
for i in range(1,71,10):
trymultishare(thread = i,times = 20)
one thread cost about 7 seconds, so i was expecting 80 threads should cost 7 secs or slightly more. but its costing 140secs link to the result screenshot, so performance for each thread decayed a whooping 95%!
is this standard or am i doing anything wrong? trying to understand why it decayed so much...
thx guys!

How to implement multiprocessing in Monte Carlo integration

I created a Python program that integrates a given function over a given interval using Monte Carlo simulation. It works well, except for the fact that it runs painfully slow when you want higher levels of accuracy (larger N value). I figured I'd give multiprocessing a try in order to speed it up, but then I realized I have no clue how to implement it. Here's what I have right now:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Process
import os
# GOAL: Approximate the integral of a function f(x) from lower bound a to upper bound b using Monte Carlo simulation
# bounds of integration
a = 0
b = np.pi
# function to integrate
def f(x):
return np.sin(x)
N = 10000
areas = []
def mcIntegrate():
for i in range(N):
# array filled with random numbers between limits
xrand = random.uniform(a, b, N)
# sum the return values of the function of each random number
integral = 0.0
for i in range(N):
integral += f(xrand[i])
# scale integral by difference of bounds divided by amount of random values
ans = integral * ((b - a) / float(N))
# add approximation to list of other approximations
areas.append(ans)
if __name__ == "__main__":
processes = []
numProcesses = os.cpu_count()
for i in range(numProcesses):
process = Process(target=mcIntegrate)
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.start()
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
Can I get some help with this implementation?
Took advice from the comments and used multiprocessor.Pool, and also cut down on some operations by using NumPy instead. Went from taking about 5min to run to now about 6sec (for N = 10000). Here's my implementation:
import scipy
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import os
# GOAL: Approximate the integral of function f from lower bound a to upper bound b using Monte Carlo simulation
a = 0 # lower bound of integration
b = np.pi # upper bound of integration
f = np.sin # function to integrate
N = 10000 # sample size
def mcIntegrate(p):
xrand = scipy.random.uniform(a, b, N) # create array filled with random numbers within bounds
integral = np.sum(f(xrand)) # sum return values of function of each random number
approx = integral * ((b - a) / float(N)) # scale integral by difference of bounds divided by sample size
return approx
if __name__ == "__main__":
# run simulation N times in parallel and store results in array
with multiprocessing.Pool(os.cpu_count()) as pool:
areas = pool.map(mcIntegrate, range(N))
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()

This turned out to be a more interesting problem than I thought it would when I got to optimising it. The basic method is very simple:
from multiprocessing import pool
def f(x):
return x
results = pool.map(f, range(100))
Here is your mcIntegerate adapted for multiprocessing:
from tqdm import tqdm
def mcIntegrate(steps):
tasks = []
print("Setting up simulations")
# linear
for _ in tqdm(range(steps)):
xrand = random.uniform(a, b, steps)
for i in range(steps):
tasks.append(xrand[i])
pool = Pool(cpu_count())
print("Simulating (no progress)")
results = pool.map(f, tasks)
pool.close()
print("summing")
areas = []
for chunk in tqdm(range(steps)):
vals = results[chunk * steps : (chunk + 1) * steps]
integral = sum(vals)
ans = integral * ((b - a) / float(steps))
areas.append(ans)
return areas
tqdm is just used to display a progress bar.
This is the basic workflow for multiprocessing: break the question up into tasks, solve all the tasks, then add them all back together again. And indeed the code as given works. (Note that I've changed your N for steps).
For completeness, the script now begins:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
# function to integrate
def f(x):
return np.sin(x)
and ends
areas = mcIntegrate(3_000)
a = 0
b = np.pi
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec="black")
plt.xlabel("Areas")
plt.show()
Optimisation
I deliberately split the problem up at the smallest possible level. Was this a good idea? To answer that, consider: how might we optimise the linear process of generating the tasks? This does take a considerable while at the moment. We could parallelise it:
def _prepare(steps):
xrand = random.uniform(a, b, steps)
return [xrand[i] for i in range(steps)]
def mcIntegrate(steps):
...
tasks = []
for res in tqdm(pool.imap(_prepare, (steps for _ in range(steps))), total=steps):
tasks += res # slower except for very large steps
Here I've used pool.imap, which returns an iterator which we can iterate as soon as the results are available, allowing us to build a progress bar. If you do this and compare, you will see that it runs slower than the linear solution. Removing the progress bar (on my machine) and replace with:
import time
start = time.perf_counter()
results = pool.map(_prepare, (steps for _ in range(steps)))
tasks = []
for res in results:
tasks += res
print(time.perf_counter() - start)
Is only marginally faster: it's still slower than running linear. Serialising data to a process and then deserialising it has an overhead. If you try to get a progress bar on the whole thing, it becomes excruciatingly slow:
results = []
for result in tqdm(pool.imap(f, tasks), total=len(tasks)):
results.append(result)
So what about iterating at a higher level? Here's another adaption of your mcIterate:
a = 0
b = np.pi
def _mcIntegrate(steps):
xrand = random.uniform(a, b, steps)
integral = 0.0
for i in range(steps):
integral += f(xrand[i])
ans = integral * ((b - a) / float(steps))
return ans
def mcIntegrate(steps):
areas = []
p = Pool(cpu_count())
for ans in tqdm(p.imap(_mcIntegrate, ((steps) for _ in range(steps))), total=steps):
areas.append(ans)
return areas
This, on my machine, is considerably faster. It's also considerably simpler. I was expecting a difference, but not such a considerable difference.
Takeaways
Multiprocessing isn't free. Something as simple as np.sin() is too cheap to multprocess: we pay to serialise, deserialise, append, and so on, all for one sin() calculation. But if you do too many calculations, you will waste time as you lose granularity. Here the effect is more striking than I was expecting. The only way to know the right level of granularity for a particular problem... is to profile and try.

My experience is that multiprocessing is often not very efficient (a ton of overhead). The more you push your code into numpy the faster it will be, with one caveat; you can overload your memory if you're not careful (10k x 10k is getting large). Lastly, it looks like N is doing double duty, both defining sample size for each estimate, and also serving as the number of trial estimates.
Here is how I would do this (with minor style changes):
import numpy as np
f = np.sin
a = 0
b = np.pi
# number samples for each trial, trial count, and number calculated at once
N = 10000
TRIALS = 10000
BATCH_SIZE=1000
def mc_integrate(f, a, b, N, batch_size=BATCH_SIZE):
# compute everything carrying `batch_size` copies by extending the array dimension.
# samples.shape == (N, batch_size)
samples = np.random.uniform(a, b, size=(N, batch_size))
integrals = np.sum(f(samples), axis=0)
mc_estimates = integrals * ((b - a) / N)
return mc_estimates
# loop over batch values to get final result
n, r = divmod(TRIALS, BATCH_SIZE)
results = []
for j in [BATCH_SIZE]*n + [r]:
results.extend(mc_integrate(f, a, b, N, batch_size=j))
On my machine this takes a few seconds.

Threading or multiprocess to run code in parallel seems slower

I've been searching with little success how to solve this problem. The script below is supposed to perform planet simulations. planet1_pars will define 1st planet parameters. set_grids_fakePlanet will create a grid for each of the parameters of a hypothetical planet put into the system. This function will return a generator not a list/array with tons of parameter values. planet2_pars will give me a set of parameters previously created in set_grids_fakePlanet, hence each time I execute planet2_pars it will give me a different set of parameters from the hypothetical planet. ComputeTTV will do some calculations and return what I need each time I execute run_rebound, which is my main function that will call all these mentioned functions above. Whenever I execute run_rebound, I need to give it the hypothetical planet parameter so it run the simulation.
def planet1_pars():
P_p1,m_p1,e_p1 = 0.7920639164 / 365.25, 29.32*3.0027e-6, 0.0 #P[yrs], m[solar],e[fixed]
inc_p1,omega_p1,M_p1 = 77.4041697839 * np.pi/180, 90., 0.
return P_p1,m_p1,e_p1,inc_p1,omega_p1,M_p1
def set_grids_fakePlanet(pars_p1):
P_p1,m_p1,e_p1,inc_p1,omega_p1,M_p1 = [*pars_p1]
#min max periods in which to put a planet
Pmin = P_p1 * 2.02 # Pmin ~ 0.0196/365.25 #shortest period found so far in exoplanet.eu
Pmax = P_p1 * 2.05
#set grids
P_grid = np.arange(Pmin, Pmax, P_p1 * 0.005)
m_p2_grid = np.arange(.5, 320, 1) * 3.0027e-6 # every 1 Earth mass to 1 Jupiter
e_grid = [0.0]#np.linspace(0,0.1, 10) # e=1 may cause code to blow up
inc_grid = [inc_p1]#np.linspace(60,90, 5)
omega_grid = [0.0]#np.linspace(0,360, 5)
M_grid = [0.0]#np.linspace(0,360, 5)
#store grid vals, each column is a parameter, last column TTV amplitude
#[n,m] n is max_size(P,e,inc,omega,M) ** m. m is # of orbital parameters + 1 ttv amp
size = len(P_grid) * len(m_p2_grid) * len(e_grid) * len(inc_grid) * len(omega_grid) * len(M_grid)
results = np.zeros([size,6+1]) * np.nan
peiom_grid = ((x,k,y,w,j,z) for x in P_grid for k in m_p2_grid for y in e_grid for w in inc_grid
for j in omega_grid for z in M_grid)
return peiom_grid
def planet2_pars():
for pars_p2 in peiom_grid:
return pars_p2
#2nd planet
# m_p2, P_p2, e_p2, inc_p2, omega_p2, M_p2 = system_parameters(n*m_p1, P_p2,e_p2,inc_p2,omega_p2,M_p2)
def computeTTVs(sim, P_p1, P_p2):
N=34
transittimes = np.zeros(int(N))
p = sim.particles
i = 0
while i<N:
y_old = p[1].y - p[0].y # (Thanks to David Martin for pointing out a bug in this line!)
t_old = sim.t
if P_p1 > P_p2:
sim.integrate(sim.t+ (P_p2 * 0.05)) # check for transits every 0.5 time units. Note that 0.5 is shorter than one orbit
else:
sim.integrate(sim.t+ (P_p1 * 0.05)) #5% of period ~ 1h which is shorter than Tdur=2h
t_new = sim.t
if y_old*(p[1].y-p[0].y)<0. and p[1].x-p[0].x>0.: # sign changed (y_old*y<0), planet in front of star (x>0)
while t_new-t_old>1e-7: # bisect until prec of 1e-5 reached
if y_old*(p[1].y-p[0].y)<0.:
t_new = sim.t
else:
t_old = sim.t
sim.integrate( (t_new+t_old)/2.)
transittimes[i] = sim.t
i += 1
sim.integrate(sim.t+ P_p1 * 0.01) # integrate 0.05 to be past the transit
A = np.vstack([np.ones(N), range(N)]).T
c, m = np.linalg.lstsq(A, transittimes, rcond=-1)[0] # fits a linear model and get period m and t0 c
comp_t0s = c + m*np.array(range(N))
OC = transittimes-comp_t0s # in years
OC *= 365.25*24*60
amp = rms(OC)
# amp = np.diff([np.min(OC), np.max(OC)])[0]
return amp #in minutes
def run_rebound(pars_p2):
ms=1.02 #solar unit
P_p1,m_p1,e_p1,inc_p1,omega_p1,M_p1 = planet1_pars()
P_p2,m_p2,e_p2,inc_p2,omega_p2,M_p2 = [*pars_p2] #fake planet
#start simulation
sim = rebound.Simulation()
sim.G = 39.478 #AU^3 yr^-2 Ms^-1
sim.add(m=ms)
sim.add(m=m_p1, P=P_p1, e=e_p1, inc=inc_p1, omega=omega_p1, M=M_p1)
sim.add(m=m_p2, P=P_p2, e=e_p2, inc=inc_p2, omega=omega_p2, M=M_p2)
#put outcomes in a list
results = [P_p2,m_p2,e_p2,inc_p2*(180/np.pi),omega_p2,M_p2, computeTTVs(sim, P_p1, P_p2)]
return results
Question: I tried to make it parallel using the threading library as in:
peiom_grid = set_grids_fakePlanet(planet1_pars()) #make the fake planet grid as a generator variable
import threading
start = time.time()
for pars in peiom_grid:
t1 = threading.Thread(target=run_rebound, args=(pars,))
t1.start()
t1.join()
end = time.time()
print((end-start) /60, 'min')
In this manner, I see the 8 CPU I got is being used but at a rate which is less than 50%.
And it takes ~ 1.2 min to run (the grids are small because I am testing, but ideally the grids should be lager so it may take days to run).
I also tried MultiProcessing
from multiprocessing import Process
start = time.time()
if __name__ == '__main__':
for pars in peiom_grid:
p = Process(target=run_rebound, args=(pars,))
p.start()
p.join()
end = time.time()
print((end-start) /60, 'min')
it takes ~ 1.7min
and without any parallelization
start = time.time()
for pars in peiom_grid:
run_rebound(pars)
end = time.time()
print((end-start)/60, 'min')
it takes ~ 1.34 min
I think I am not doing any parallelization because the difference between the runs above with/without parallelization isn't significant. I cannot find where the issue is. I followed a few examples and check several examples on stack overflow but nothing... Hope you guys can give me some feedback.

In case of multithreading - Mark is right, the bottleneck is Python GIL. However, multiprocessing is free of this limitation (but is subject to a different overhead, minimal in this case).
The reason you don't see any improvement is because .join() waits for process execution. So, this implementation starts a single process and then immediately blocks until it is complete. To fix this, move .join() out of the process creation loop:
processes = []
for pars in peiom_grid:
p = Process(target=run_rebound, args=(pars,))
p.start()
processes.append(p)
for p in processes:
p.join()
A more straightforward way to do this would be to use process pool:
from multiprocessing import Pool
with Pool() as pool: # will use the number of CPUs in the system by default
results = pool.map(run_rebound, peiom_grid)

cuda python GPU numbapro 3d loop poor performance

I am trying to set up a 3D loop with the assignment
C(i,j,k) = A(i,j,k) + B(i,j,k)
using Python on my GPU. This is my GPU:
http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications
The sources I'm looking at / comparing with are:
http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43
http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb
It's possible that I've imported more modules than necessary. This is my code:
import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *
#autojit
def myAdd(a, b):
return a+b
myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)
#cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
tz = cuda.threadIdx.z
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bz = cuda.blockIdx.z
bw = cuda.blockDim.x
bh = cuda.blockDim.y
bd = cuda.blockDim.z
i = tx + bx * bw
j = ty + by * bh
k = tz + bz * bd
if i >= c.shape[0]:
return
if j >= c.shape[1]:
return
if k >= c.shape[2]:
return
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
# c[i,j,k] = a[i,j,k] + b[i,j,k]
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
def main():
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores
N = 100
thread_ct = my_gpu.WARP_SIZE
block_ct = int(math.ceil(float(N) / thread_ct))
print "Threads per block:", thread_ct
print "Block per grid:", block_ct
a = np.ones((N,N,N), dtype = np.float32)
b = np.ones((N,N,N), dtype = np.float32)
c = np.zeros((N,N,N), dtype = np.float32)
start = timer()
cg = cuda.to_device(c)
myAdd_kernel[block_ct, thread_ct](a,b,cg)
cg.to_host()
dt = timer() - start
print "Wall clock time with GPU in %f s" % dt
print 'c[:3,:,:] = ' + str(c[:3,1,1])
print 'c[-3:,:,:] = ' + str(c[-3:,1,1])
if __name__ == '__main__':
main()
My result from running this is the following:
Running on GPU: GeForce GT 520
Compute capability: 2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2. 2. 2.]
c[-3:,:,:] = [ 2. 2. 2.]
When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.
I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!
Any and all help is greatly appreciated!

You are creating your indexes correctly but then you're ignoring them.
Running the nested loop
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.
I think something like this should work better...
i = tx + bx * bw
while i < c.shape[0]:
j = ty+by*bh
while j < c.shape[1]:
k = tz + bz * bd
while k < c.shape[2]:
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
k+=cuda.blockDim.z*cuda.gridDim.z
j+=cuda.blockDim.y*cuda.gridDim.y
i+=cuda.blockDim.x*cuda.gridDim.x
Try to compile and run it. Also make sure to validate it, as I have not.

I don't see you using imshow, or show, so there is no need to import those.
It doesn't appear as though you use your import of math (I didn't see any calls of math.some_function.
Your imports from numba and numbapro seem repetitive. Your "from numba import cuda" overrides your "from numbapro import cuda", since it is subsequent to it. Your calls to cuda use the cuda in numba not numbapro. When you call "from numba import *", you import everything from numba, not just cuda, which seems to be the only thing you use. Also, (I believe) import numba.cuda is equivalent to from numba import cuda. Why not eliminate all your imports from numba and numbapro with a single "from numba import cuda".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing not using whole CPU - python

Related

the first calculation with torch.einsum is much slower

python multiprocessing performance decay very fast with core numbers

How to implement multiprocessing in Monte Carlo integration

Threading or multiprocess to run code in parallel seems slower

cuda python GPU numbapro 3d loop poor performance

Categories

Resources