Any way to speed up runs? - python

Im currently trying to do a monte carlo simulation, the problem is its taking quite a while to run 100,000 runs or more when Im told it shouldnt take very long.
Heres my code:
runs = 10000
import matplotlib.pyplot as plt
import random
import numpy as np
from scipy.stats import norm
from scipy.stats import uniform
import seaborn as sns
import pandas
def steadystate():
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = random.randint(86900000,2230000000000)
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
#results = steadystate (Faster)
results = np.array([steadystate() for _ in range(1000)])
print(results)
x+=1
ax = sns.distplot(results,
bins=100,
kde=True,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')
Im fairly new at python so Im unsure of where to optimize my code. Any help or suggestions would be much appreaciated.

You're not actually benefiting from numpy here, because you produce each value one at a time, doing all the math for that one value, then producing the array from the results. Work with arrays from the get-go, and do all the work on all elements in bulk to derive the benefits of vectorization:
import numpy.random
def steadystate(count): # Receive desired number of values for bulk generation
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = numpy.random.randint(86900000, 2230000000000, count) # Make array of count values all at once
# Perform all the math in bulk
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
results = steadystate(1000) # Just call with number of desired items
print(results)
x+=1
Note that this code matches your original code by replacing results each time, rather than accumulating results. I'm not clear on what you what to do instead, so this is just doing the (probably) wrong thing much faster.

About 70% of the time you are losing is with the creation of the random numbers.
The question is whether you need each time random numbers? Would it be sufficient may be to generate the random matrix just once and reuse it.
However, the code is pretty quick isn't it. Except the drawing part this par took for one iteration just 1.2 ms.
%timeit results = np.array([steadystate() for _ in range(1000)])
1.24 ms ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Integrating Out Dimension from MultiDimensional Array using Parallel Processing

I was hoping to find some clever approaches to solving a parallel-processing problem I've been struggling with. Basically, I am dealing with 20,160 multidimensional arrays with size (72,35,25,20). Currently, I'm integrating out the dimension with size 72 by simply doing a trapezoidal integration in a nested for-loop. My end goal is to get an output array with size (20160,35,25,20).
for idx,filename in enumerate(filenames):
#Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
for i in range(35):
for j in range(25):
for k in range(20):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[idx,i,j,k]=omni_flux #This will have size (20160,35,25,20)
I believe it would be most beneficial to implement the parallelization lower in the nested for-loop but can't seem to figure out how. I have searched for common questions, but none [that I have found] provide enough insight into how to implement shared memory, pass multidimensional arrays to the pools, and/or reshape the resulting array. Any help or insight would be greatly appreciated.
You can use Numba so to speed up this code by a large margin. Numba is a JIT compiler that is able to compile Numpy-based code to fast native codes (so loops are not a issue with it, in fact this is a good idea to use loops in Numba).
The first thing to do is to pre-compute np.sin(PA) once so to avoid repeated computations. Then, dir_flux * np.sin(PA) can be computed using a for loop and the result can be stored in a pre-allocated array so not to perform millions of expensive small array allocations. The outer loop can be executed using multiple threads using prange and the Numba flag parallel=True. It can be further accelerated using the flag fastmath=True assuming the input values are not special (like NaN or Inf or very very small: see subnormal numbers).
While this should theoretically enough got get a fast code, the current implementation of np.trapz is not efficient as it performs expensive allocations. One can easily rewrite the function so not to allocated any additional arrays.
Here are the resulting code:
import numpy as np
import numba as nb
#nb.njit('(float64[::1], float64[::1])')
def trapz(y, x):
s = 0.0
for i in range(x.size-1):
dx = x[i+1] - x[i]
dy = y[i] + y[i+1]
s += dx * dy
return s * 0.5
#nb.njit('(float64[:,:,:,:], float64[:])', parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
flattenPA = np.ascontiguousarray(PA)
sinPA = np.sin(flattenPA)
for i in nb.prange(si):
tmp = np.empty(sl)
for j in range(sj):
for k in range(sk):
dir_flux = flux[:, i, j, k]
for l in range(sl):
tmp[l] = dir_flux[l] * sinPA[l]
omni_flux = trapz(tmp, flattenPA)
data[i, j, k] = omni_flux
return data
for idx,filename in enumerate(filenames):
# Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
data[idx] = compute(flux, PA)
Note flux and PA must be Numpy arrays. Also note that trapz is accurate as long as len(PA) is relatively small and np.std(PA) is not huge. Otherwise a pair-wise summation or even a (paranoid) Kahan summation should help (note Numpy use a pair-wise summation). In practice, results are the same on random normal numbers.
Further optimizations
The code can be made even faster by making flux accesses more contiguous. An efficient transposition can be used to do that (the one of Numpy is not efficient). However, this is not simple to do on 4D arrays. Another solution is to compute the trapz operation on whole lines of the k dimension. This makes the computation very efficient and nearly memory-bound on my machine. Here is the code:
#nb.njit('(float64[:,:,:,:], float64[:])', fastmath=True, parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
sinPA = np.sin(PA)
premultPA = PA * 0.5
for i in nb.prange(si):
for j in range(sj):
dir_flux = flux[:, i, j, :]
data[i, j, :].fill(0.0)
for l in range(sl-1):
dx = premultPA[l+1] - premultPA[l]
fact1 = dx * sinPA[l]
fact2 = dx * sinPA[l+1]
for k in range(sk):
data[i, j, k] += fact1 * dir_flux[l, k] + fact2 * dir_flux[l+1, k]
return data
Note the premultiplication make the computation slightly less precise.
Results
Here are results on random numbers (like #DominikStańczak used) on my 6-core machine (i5-9600KF processor):
Initial sequential solution: 193.14 ms (± 1.8 ms)
DominikStańczak sequential vectorized solution: 8.68 ms (± 48.1 µs)
Numba parallel solution without fastmath: 0.48 ms (± 6.7 µs)
Numba parallel solution without fastmath: 0.38 ms (± 9.5 µs)
Best Numba solution (with fastmath): 0.32 ms (± 5.2 µs)
Optimal lower-bound execution: 0.24 ms (RAM bandwidth saturation)
Thus, the fastest Numba version is 27 times faster than the (sequential) version of #DominikStańczak and 604 times faster than the initial one. It is nearly optimal.
As a first step, let's vectorize the code itself. I'm just going to deal with doing this on a per-file basis for now, to show you how to get rid of the nested for loop:
shape = (72, 35, 25, 20)
flux = np.random.normal(size=shape)
PA = np.random.normal(size=shape[0])
Now, timing your implementation, rewritten a little:
%%timeit
data = np.empty(shape[1:])
for i in range(shape[1]):
for j in range(shape[2]):
for k in range(shape[3]):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[i,j,k]=omni_flux
# 211 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My first idea was to pull sin out of the for loop, because there's no need to recalculate it each time, but that got me 10ms tops. However, if instead of the for loop, we used plain numpy vectorization via broadcasting, turning sin_PA into a (72, 1, 1, 1)-shaped array:
%%timeit
sin_PA = np.sin(PA).reshape(-1, 1, 1, 1)
data = np.trapz(flux * sin_PA, x=PA, axis=0)
# 9.03 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's a 20 times speed up, nothing to scoff at. I estimate it'd take about three minutes for all your files. You can also use np.allclose to verify the results agree up to floating point error.
If you still need to parallelize this afterwards, I would use dask.array In fact, if you've got your data in netcdf4 files, I would use xarray (which is helpful for multidimensional data anyway) to read those, and then run the trapz computations on that with Dask enabled in the backend. I think that's the simplest way to achieve easy multiprocessing in this case. Here's a quick sketch:
import xarray
from Dask.distributed import Client
client = Client()
file_data = xarray.open_mfdataset(filenames, parallel=True)
# massage the data a little, probably
flux = file_data["FluxHydrogen"]
PA = file_data["PitchAngleGrid"]
integrand = flux * np.sin(PA) # most element-wise numpy operations work on xarray ones or Dask based ones without a hitch
data = integrand.integrate(coord="PitchAngle") # or some such name for the dimension you're integrating out

np fast random sampling with boolean "filter"

I'm wondering if there is a way to speed up the following piece of code using numpy.
The only value I'm interested in is the points_within_distance, the numpy array can be discarded or modified if needed.
for _ in range(SAMPLE_SIZE):
random_point = (random.uniform(0, cube.length), random.uniform(0, cube.length))
if distance_to_center(random_point, cube) <= sphere.r:
points_within_distance += 1
This currently clocks around 0.750ms for a sample size of 1000.
I have tried
samples = np.random.random_sample((SAMPLE_SIZE, 2)) * cube.length # random sample of SAMPLE_SIZE
for row in samples:
if distance_to_center(row, cube) < sphere.r:
points_within_distance += 1
However this clearly is even more inefficient and clocks in at around 1.2 ms.
I'm not quite sure how to go about using masks in this scenario or if masks is even the right thing here to utilize.
By avoiding a for loop I think it can speed up much more. I mean generate samples all at once and compare all at one. Something like:
x = np.random.uniform(0, cube.length, size=SAMPLE_SIZE)
y = np.random.uniform(0, cube.length, size=SAMPLE_SIZE)
distance = np.square(x - x_center) + np.square(y - y_center)
points_within_distance = (distance <= r**2).sum()
I tested it using IPython's %%timeit for sample size of 1000. and it says: 129 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Vectorize a for loop over pandas ewm function

So I want to vectorize a for loop to speed things up. my code is the following:
import numpy as np
import pandas as pd
def my_func(array, n):
return pd.Series(array).ewm(span = n, min_periods = n-1).mean().to_numpy()
np.random.seed(0)
data_size = 120000
data = np.random.uniform(0,1000, size = data_size)+29000
loop_size = 1000
step_size = 1
X = np.zeros([data.shape[0], loop_size])
parameter_array = np.arange(1,loop_size+ step_size, step_size)
for i in parameter_array:
X[:, i-1] = my_func(data, i)
The entire for-loop takes up about a min to finish, which could be a problem for future application. I have already checked the numpy.vectorize(), but it states clearly that it is for convenience only, so using it won't speed up the code by an order of magnitude.
My question is that is there a way to vectorize the for loop like this? If so, can I see a simple example of how this can be done?
Thank you in advance

Concurrently reading numpy arrays in parallel

Consider the following:
fine = np.random.uniform(0,100,10)
fine[fine<20] = 0 # introduce some intermittency
coarse = np.sum(fine.reshape(-1,2),axis=1)
fine is a timeseries of magnitudes (e.g. volume of rainfall). coarse is the same timeseries but at a halved resolution, so every 2 timesteps in fine are aggregated to a single value in coarse.
I am then interested in the weighting that determines the proportions of the magnitude of coarse that corresponds to each timestep in fine for the instances where the value of coarse is above zero.
def w_xx(fine, coarse):
weights = []
for i, val in enumerate(coarse):
if val > 0:
w = fine[i*2:i*2+2]/val # returns both w1 and w2, w1 is 1st element, w2 = 1-w1 is second
weights.append(w)
return np.asarray(weights)
So w_xx(fine,coarse) would return an array of shape 5,2 where the elements of axis=1 are the weights of fine for a value of coarse.
This is all fine for smaller timeseries, but I'm running this analysis on ~60k-sized arrays of fine, plus in a loop of 300+ iterations.
I have been trying to make this run in parallel using the multiprocessing library in Python2.7 but I've not managed to get far. I need to be be reading both timeseries at the same time in order to get the corresponding values of fine for every value in coarse, plus to only work for values above 0, which is what my analysis requires.
I would appreciate suggestions on a better way to do this. I imagine if I can define a mapping function to use with Pool.map in multiprocessing, I should be able to parallelize this? I've only just started out with multiprocessing so I don't know if there is another way?
Thank you.
You can achieve the same result in a vectorized form by simply doing:
>>> (fine / np.repeat(coarse, 2)).reshape(-1, 2)
then you may filter out rows which coarse is zero, by using np.isfinite since if coarse is zero the output is either inf or nan.
In addition to the NumPy expression proposed by #behzad.nouri, you can use the Pythran compiler to reap extra speedups:
$ cat w_xx.py
#pythran export w_xx(float[], float[])
import numpy as np
def w_xx(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1, 2)
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 1.5 msec per loop
$ pythran w_xx.py -fopenmp -march=native # yes, this generates parallel code
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 867 usec per loop
Disclaimer: I am a Pythran dev.
Excellent! I didn't know about np.repeat, thank you very much.
To answer my original question in the form it was presented, I've then also managed to make this work with multiprocessing:
import numpy as np
from multiprocessing import Pool
fine = np.random.uniform(0,100,100000)
fine[fine<20] = 0
coarse = np.sum(fine.reshape(-1,2),axis=1)
def wfunc(zipped):
return zipped[0]/zipped[1]
def wpar(zipped, processes):
p = Pool(processes)
calc = np.asarray(p.map(wfunc, zip(fine,np.repeat(coarse,2))))
p.close()
p.join()
return calc[np.isfinite(calc)].reshape(-1,2)
However, the suggestion by #behzad.nouri is evidently better:
def w_opt(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1,2)
#using some iPython magic
%timeit w_opt(fine,coarse)
1000 loops, best of 3: 1.88 ms per loop
%timeit w_xx(fine,coarse)
1 loops, best of 3: 342 ms per loop
%timeit wpar(zip(fine,np.repeat(coarse,2)),6) #I've 6 cores at my disposal
1 loops, best of 3: 1.76 s per loop
Thanks again!

Numpy: averaging many datapoints at each time step

This question is probably answered somewhere, but I cannot find where, so I will ask here:
I have a set of data consisting of several samples per timestep. So, I basically have two arrays, "times", which looks something like: (0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,...) and my data which is the value for each time. Each timestep has a random number of samples. I would like to get the average value of the data at each timestep in an efficient manner.
I have prepared the following sample code to show what my data looks like. Basically, I am wondering if there is a more efficient way to write the "average_values" function.
import numpy as np
import matplotlib.pyplot as plt
def average_values(x,y):
unique_x = np.unique(x)
averaged_y = [np.mean(y[x==ux]) for ux in unique_x]
return unique_x, averaged_y
#generate our data
times = []
samples = []
#we have some timesteps:
for time in np.linspace(0,10,101):
#and a random number of samples at each timestep:
num_samples = np.random.random_integers(1,10)
for i in range(0,num_samples):
times.append(time)
samples.append(np.sin(time)+np.random.random()*0.5)
times = np.array(times)
samples = np.array(samples)
plt.plot(times,samples,'bo',ms=3,mec=None,alpha=0.5)
plt.plot(*average_values(times,samples),color='r')
plt.show()
Here is what it looks like:
A generic code to do this would do something as follows:
def average_values_bis(x, y):
unq_x, idx = np.unique(x, return_inverse=True)
count_x = np.bincount(idx)
sum_y = np.bincount(idx, weights=y)
return unq_x, sum_y / count_x
Adding the function above and following line for the plotting to your script
plt.plot(*average_values_bis(times, samples),color='g')
produces this output, with the red line hidden behind the green one:
But timing both approaches reveals the benefits of using bincount, a 30x speed-up:
%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop
%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop
May I propose a pandas solution. It is highly recommended if you are going to be working with time series.
Create test data
import pandas as pd
import numpy as np
times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')
Calculate averages
avs = s.groupby(level=0).mean()
avs.plot()

Categories