Concurrently reading numpy arrays in parallel

Concurrently reading numpy arrays in parallel - python

Consider the following:
fine = np.random.uniform(0,100,10)
fine[fine<20] = 0 # introduce some intermittency
coarse = np.sum(fine.reshape(-1,2),axis=1)
fine is a timeseries of magnitudes (e.g. volume of rainfall). coarse is the same timeseries but at a halved resolution, so every 2 timesteps in fine are aggregated to a single value in coarse.
I am then interested in the weighting that determines the proportions of the magnitude of coarse that corresponds to each timestep in fine for the instances where the value of coarse is above zero.
def w_xx(fine, coarse):
weights = []
for i, val in enumerate(coarse):
if val > 0:
w = fine[i*2:i*2+2]/val # returns both w1 and w2, w1 is 1st element, w2 = 1-w1 is second
weights.append(w)
return np.asarray(weights)
So w_xx(fine,coarse) would return an array of shape 5,2 where the elements of axis=1 are the weights of fine for a value of coarse.
This is all fine for smaller timeseries, but I'm running this analysis on ~60k-sized arrays of fine, plus in a loop of 300+ iterations.
I have been trying to make this run in parallel using the multiprocessing library in Python2.7 but I've not managed to get far. I need to be be reading both timeseries at the same time in order to get the corresponding values of fine for every value in coarse, plus to only work for values above 0, which is what my analysis requires.
I would appreciate suggestions on a better way to do this. I imagine if I can define a mapping function to use with Pool.map in multiprocessing, I should be able to parallelize this? I've only just started out with multiprocessing so I don't know if there is another way?
Thank you.

You can achieve the same result in a vectorized form by simply doing:
>>> (fine / np.repeat(coarse, 2)).reshape(-1, 2)
then you may filter out rows which coarse is zero, by using np.isfinite since if coarse is zero the output is either inf or nan.

In addition to the NumPy expression proposed by #behzad.nouri, you can use the Pythran compiler to reap extra speedups:
$ cat w_xx.py
#pythran export w_xx(float[], float[])
import numpy as np
def w_xx(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1, 2)
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 1.5 msec per loop
$ pythran w_xx.py -fopenmp -march=native # yes, this generates parallel code
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 867 usec per loop
Disclaimer: I am a Pythran dev.

Excellent! I didn't know about np.repeat, thank you very much.
To answer my original question in the form it was presented, I've then also managed to make this work with multiprocessing:
import numpy as np
from multiprocessing import Pool
fine = np.random.uniform(0,100,100000)
fine[fine<20] = 0
coarse = np.sum(fine.reshape(-1,2),axis=1)
def wfunc(zipped):
return zipped[0]/zipped[1]
def wpar(zipped, processes):
p = Pool(processes)
calc = np.asarray(p.map(wfunc, zip(fine,np.repeat(coarse,2))))
p.close()
p.join()
return calc[np.isfinite(calc)].reshape(-1,2)
However, the suggestion by #behzad.nouri is evidently better:
def w_opt(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1,2)
#using some iPython magic
%timeit w_opt(fine,coarse)
1000 loops, best of 3: 1.88 ms per loop
%timeit w_xx(fine,coarse)
1 loops, best of 3: 342 ms per loop
%timeit wpar(zip(fine,np.repeat(coarse,2)),6) #I've 6 cores at my disposal
1 loops, best of 3: 1.76 s per loop
Thanks again!

Related

Any way to speed up runs?

Im currently trying to do a monte carlo simulation, the problem is its taking quite a while to run 100,000 runs or more when Im told it shouldnt take very long.
Heres my code:
runs = 10000
import matplotlib.pyplot as plt
import random
import numpy as np
from scipy.stats import norm
from scipy.stats import uniform
import seaborn as sns
import pandas
def steadystate():
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = random.randint(86900000,2230000000000)
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
#results = steadystate (Faster)
results = np.array([steadystate() for _ in range(1000)])
print(results)
x+=1
ax = sns.distplot(results,
bins=100,
kde=True,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')
Im fairly new at python so Im unsure of where to optimize my code. Any help or suggestions would be much appreaciated.

You're not actually benefiting from numpy here, because you produce each value one at a time, doing all the math for that one value, then producing the array from the results. Work with arrays from the get-go, and do all the work on all elements in bulk to derive the benefits of vectorization:
import numpy.random
def steadystate(count): # Receive desired number of values for bulk generation
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = numpy.random.randint(86900000, 2230000000000, count) # Make array of count values all at once
# Perform all the math in bulk
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
results = steadystate(1000) # Just call with number of desired items
print(results)
x+=1
Note that this code matches your original code by replacing results each time, rather than accumulating results. I'm not clear on what you what to do instead, so this is just doing the (probably) wrong thing much faster.

About 70% of the time you are losing is with the creation of the random numbers.
The question is whether you need each time random numbers? Would it be sufficient may be to generate the random matrix just once and reuse it.
However, the code is pretty quick isn't it. Except the drawing part this par took for one iteration just 1.2 ms.
%timeit results = np.array([steadystate() for _ in range(1000)])
1.24 ms ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Integrating Out Dimension from MultiDimensional Array using Parallel Processing

I was hoping to find some clever approaches to solving a parallel-processing problem I've been struggling with. Basically, I am dealing with 20,160 multidimensional arrays with size (72,35,25,20). Currently, I'm integrating out the dimension with size 72 by simply doing a trapezoidal integration in a nested for-loop. My end goal is to get an output array with size (20160,35,25,20).
for idx,filename in enumerate(filenames):
#Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
for i in range(35):
for j in range(25):
for k in range(20):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[idx,i,j,k]=omni_flux #This will have size (20160,35,25,20)
I believe it would be most beneficial to implement the parallelization lower in the nested for-loop but can't seem to figure out how. I have searched for common questions, but none [that I have found] provide enough insight into how to implement shared memory, pass multidimensional arrays to the pools, and/or reshape the resulting array. Any help or insight would be greatly appreciated.

You can use Numba so to speed up this code by a large margin. Numba is a JIT compiler that is able to compile Numpy-based code to fast native codes (so loops are not a issue with it, in fact this is a good idea to use loops in Numba).
The first thing to do is to pre-compute np.sin(PA) once so to avoid repeated computations. Then, dir_flux * np.sin(PA) can be computed using a for loop and the result can be stored in a pre-allocated array so not to perform millions of expensive small array allocations. The outer loop can be executed using multiple threads using prange and the Numba flag parallel=True. It can be further accelerated using the flag fastmath=True assuming the input values are not special (like NaN or Inf or very very small: see subnormal numbers).
While this should theoretically enough got get a fast code, the current implementation of np.trapz is not efficient as it performs expensive allocations. One can easily rewrite the function so not to allocated any additional arrays.
Here are the resulting code:
import numpy as np
import numba as nb
#nb.njit('(float64[::1], float64[::1])')
def trapz(y, x):
s = 0.0
for i in range(x.size-1):
dx = x[i+1] - x[i]
dy = y[i] + y[i+1]
s += dx * dy
return s * 0.5
#nb.njit('(float64[:,:,:,:], float64[:])', parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
flattenPA = np.ascontiguousarray(PA)
sinPA = np.sin(flattenPA)
for i in nb.prange(si):
tmp = np.empty(sl)
for j in range(sj):
for k in range(sk):
dir_flux = flux[:, i, j, k]
for l in range(sl):
tmp[l] = dir_flux[l] * sinPA[l]
omni_flux = trapz(tmp, flattenPA)
data[i, j, k] = omni_flux
return data
for idx,filename in enumerate(filenames):
# Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
data[idx] = compute(flux, PA)
Note flux and PA must be Numpy arrays. Also note that trapz is accurate as long as len(PA) is relatively small and np.std(PA) is not huge. Otherwise a pair-wise summation or even a (paranoid) Kahan summation should help (note Numpy use a pair-wise summation). In practice, results are the same on random normal numbers.
Further optimizations
The code can be made even faster by making flux accesses more contiguous. An efficient transposition can be used to do that (the one of Numpy is not efficient). However, this is not simple to do on 4D arrays. Another solution is to compute the trapz operation on whole lines of the k dimension. This makes the computation very efficient and nearly memory-bound on my machine. Here is the code:
#nb.njit('(float64[:,:,:,:], float64[:])', fastmath=True, parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
sinPA = np.sin(PA)
premultPA = PA * 0.5
for i in nb.prange(si):
for j in range(sj):
dir_flux = flux[:, i, j, :]
data[i, j, :].fill(0.0)
for l in range(sl-1):
dx = premultPA[l+1] - premultPA[l]
fact1 = dx * sinPA[l]
fact2 = dx * sinPA[l+1]
for k in range(sk):
data[i, j, k] += fact1 * dir_flux[l, k] + fact2 * dir_flux[l+1, k]
return data
Note the premultiplication make the computation slightly less precise.
Results
Here are results on random numbers (like #DominikStańczak used) on my 6-core machine (i5-9600KF processor):
Initial sequential solution: 193.14 ms (± 1.8 ms)
DominikStańczak sequential vectorized solution: 8.68 ms (± 48.1 µs)
Numba parallel solution without fastmath: 0.48 ms (± 6.7 µs)
Numba parallel solution without fastmath: 0.38 ms (± 9.5 µs)
Best Numba solution (with fastmath): 0.32 ms (± 5.2 µs)
Optimal lower-bound execution: 0.24 ms (RAM bandwidth saturation)
Thus, the fastest Numba version is 27 times faster than the (sequential) version of #DominikStańczak and 604 times faster than the initial one. It is nearly optimal.

As a first step, let's vectorize the code itself. I'm just going to deal with doing this on a per-file basis for now, to show you how to get rid of the nested for loop:
shape = (72, 35, 25, 20)
flux = np.random.normal(size=shape)
PA = np.random.normal(size=shape[0])
Now, timing your implementation, rewritten a little:
%%timeit
data = np.empty(shape[1:])
for i in range(shape[1]):
for j in range(shape[2]):
for k in range(shape[3]):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[i,j,k]=omni_flux
# 211 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My first idea was to pull sin out of the for loop, because there's no need to recalculate it each time, but that got me 10ms tops. However, if instead of the for loop, we used plain numpy vectorization via broadcasting, turning sin_PA into a (72, 1, 1, 1)-shaped array:
%%timeit
sin_PA = np.sin(PA).reshape(-1, 1, 1, 1)
data = np.trapz(flux * sin_PA, x=PA, axis=0)
# 9.03 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's a 20 times speed up, nothing to scoff at. I estimate it'd take about three minutes for all your files. You can also use np.allclose to verify the results agree up to floating point error.
If you still need to parallelize this afterwards, I would use dask.array In fact, if you've got your data in netcdf4 files, I would use xarray (which is helpful for multidimensional data anyway) to read those, and then run the trapz computations on that with Dask enabled in the backend. I think that's the simplest way to achieve easy multiprocessing in this case. Here's a quick sketch:
import xarray
from Dask.distributed import Client
client = Client()
file_data = xarray.open_mfdataset(filenames, parallel=True)
# massage the data a little, probably
flux = file_data["FluxHydrogen"]
PA = file_data["PitchAngleGrid"]
integrand = flux * np.sin(PA) # most element-wise numpy operations work on xarray ones or Dask based ones without a hitch
data = integrand.integrate(coord="PitchAngle") # or some such name for the dimension you're integrating out

Accelerating scientific python program

I have the following code in python:
def P(z, u0):
x = np.inner(z, u0)
tmp = x*u0
return (z - tmp)
def powerA2(A, u0):
x0 = np.random.rand(len(A))
for i in range(ITERATIONS):
x0 = P(np.dot(A, x0), u0)
x0 = x0 / np.linalg.norm(x0)
return (np.inner(np.dot(A, x0), x0))
np is numpy package.
I am interested in running this code for matrices in size of 100,000 * 100,000, but it seems that there is no chance for this program to run fast (I need to run it many times, about 10,000).
Is there any chance that tricks like multi-threading would work here?
Does anything else help to accelerate it?

You could consider using Pythran. Compiling the following code (norm.py):
#pythran export powerA2(float [][], float[])
import numpy as np
def P(z, u0):
x = np.inner(z, u0)
tmp = x*u0
return (z - tmp)
def norm(x):
return np.sqrt(np.sum(np.abs(x)**2))
def powerA2(A, u0):
ITERATIONS = 100
x0 = np.random.random(len(A))
for i in range(ITERATIONS):
x0 = P(np.dot(A, x0), u0)
x0 = x0 / norm(x0)
return (np.inner(np.dot(A, x0), x0))
with:
pythran norm.py
yields the following speedup:
$ python -m timeit -s 'import numpy as np; A = np.random.rand(100, 100); B = np.random.random(100); import norm' 'norm.powerA2(A, B)'
100 loops, best of 3: 3.1 msec per loop
$ pythran norm.py -O3 -march=native
$ python -m timeit -s 'import numpy as np; A = np.random.rand(100, 100); B = np.random.random(100); import norm' 'norm.powerA2(A, B)'
1000 loops, best of 3: 937 usec per loop

Just to check: you want to do 10^4 operations of something 10^10... so even if your operation is O(1), that's still 10^14 operations, which is a pretty hard problem (and as haraldkl pointed out in his comment, this is also eating a ton of memory) Just to check: are you going to call powerA2 10,000 times, or is 10,000 your desired value for ITERATIONS. If the former, you could use threads (or better yet, separate processes) to get some parallization but I don't know if that's going to be enough; if the latter, unless there's a trick I'm missing, your inputs don't seem as paralizable since the input for each loop iteration depend on the outputs of the previous. There may be a way to do this on GPU (I would like to think there'd be an efficient way to at least do the normalization bit such that it could do large numbers of stuff quickly by using vectorization)
Edit in response to comment: cpython (which is the most common python implementation) has a Global Interpeter Lock (GIL); some other python implementations (jython, ironpython) do not; per https://wiki.python.org/moin/GlobalInterpreterLock, .
Note that potentially blocking or long-running operations, such as
I/O, image processing, and NumPy number crunching, happen outside the
GIL. Therefore it is only in multithreaded programs that spend a lot
of time inside the GIL, interpreting CPython bytecode, that the GIL
becomes a bottleneck.
As far as I know, it should be possible to use threads with numpy and not be horribly bottlenecked but your problem still looks hard to convert to threads unless there's some bit of math I'm missing.

I get a 10% improvement over the uncompilled serge-sans-paille version by redefining functions this way:
def P0(z, u0):
x = np.inner(z, u0)
x *= u0
return (z - x)
def norm0(x):
return np.sqrt(np.sum(x*x))
def powerA20(A, u0):
ITERATIONS = 100
x0 = np.random.random(len(A))
for i in range(ITERATIONS):
x0 = P0(np.dot(A, x0), u0)
x0 /= norm0(x0)
return (np.inner(np.dot(A, x0), x0))
Doing things like *= u0 instead of x = x*u0 avoids unnecesary copies of the variables in RAM, speeding the program up a little bit.
Also, you don't need abs in that case. And finally, x*x is slightly faster than x**2.

Rolling Statistics on Image Python

I need to calculate the local statistics of a image depending on a 2D Window block defined by the user. Stats include : Mean, Variance, Skew, Kurtosis. I need to traverse through each pixel of the image and find the neighboring pixels depending on the window size.
The code that I used was:
scipy.ndimage.generic_filter(array,numpy.var,size=3)
But the performance through this is very low. I even tried strides-numpy but that too isn't showing much difference (wasn't able to compute skewness, kurtosis). I'm not familiar with Cython so have not ventured into that option.
So is there any other way to accomplish this without Cython?

The reason uniform_filter() is so much faster than generic_filter() is due to Python -- for generic_filter(), Python gets called for each pixel, while for uniform_filter(), the whole image is processed in native code. (I found OpenCV's boxFilter() even faster than uniform_filter(), see my answer to a "window variance" question.)
In the remainder of this answer, I show how to do a skew calculation using uniform_filter(), which dramatically speeds up a generic_filter()-based version such as:
import scipy.ndimage as ndimage, scipy.stats as st
ndimage.generic_filter(img, st.skew, size=(1,5))
SciPy's st.skew() (see, e.g., v0.17.0) appears to calculate the skew as
m3 / m2**1.5
where m3 = E[(X-m)**3] (the third central moment), m2 = E[(X-m)**2] (the variance), and m = E[X] (the mean).
To use uniform_filter(), one has to write this in terms of raw moments such as m3p = E[X**3] and m2p = E[X**2] (a prime symbol is usually used to distinguish the raw moment from the central one):
m3 = E[(X-m)**3] = ... = m3p - 3*m*m2p + 2*m**3
m2 = E[(X-m)**2] = ... = m2p - m*m
(In case my "..." skips too much, this answer has the full derivation for m2.) Then one can implement skew() using uniform_filter() (or boxFilter() for some additional speedup):
def winSkew(img, wsize):
imgS = img*img
m, m2p, m3p = (ndimage.uniform_filter(x, wsize) for x in (img, imgS, imgS*img))
mS = m*m
return (m3p-3*m*m2p+2*mS*m)/(m2p-mS)**1.5
Compared to generic_filter(), winSkew() gives a 654-fold speedup on the following example on my machine:
In [185]: img = np.random.randint(0, 256, (500,500)).astype(np.float)
In [186]: %timeit ndimage.generic_filter(img, st.skew, size=(1,5))
1 loops, best of 3: 14.2 s per loop
In [188]: %timeit winSkew(img, (1,5))
10 loops, best of 3: 21.7 ms per loop
And the two calculations give essentially identical results:
In [190]: np.allclose(winSkew(img, (1,5)), ndimage.generic_filter(img, st.skew, size=(1,5)))
Out[190]: True
The code for a Kurtosis calculation can be derived the same way.

The problem is that generic_filter() cannot assume that your filter is separable along the x or y axes. Thus it must operate as a true 2D filter rather than a series of two 1D filters, so run-time will be much slower.
The mean filter and is equivalent (I think) to the uniform_filter(), which if you read the documentation, is implemented as a series of two 1d uniform filters.
I compared timing via this code block:
import numpy as np
from scipy import ndimage as ndi
from scipy import misc
baboonfile = '/Users/curt/Downloads/BaboonRGB.jpg' #local download of http://read.pudn.com/downloads169/sourcecode/graph/texture_mapping/776733/Picture/BaboonRGB__.jpg
im = misc.imread(baboonfile)
meanfilt2D = ndi.generic_filter(im, np.mean, size=[3, 3, 1])
%timeit meanfilt2D = ndi.generic_filter(im, np.mean, size=[3, 3, 1])
print meanfilt2D.shape
meanfiltU = ndi.uniform_filter(im, size=[3, 3, 1])
%timeit meanfiltU = ndi.uniform_filter(im, size=[3, 3, 1])
print meanfiltU.shape
The output of that block was:
1 loops, best of 3: 5.22 s per loop
(512, 512, 3)
100 loops, best of 3: 11.8 ms per loop
(512, 512, 3)
so true two-dimensional generic_filter() takes 5 seconds for a small image but the two-pass 1D uniform_filter() takes only milliseconds. (N.B.: The difference image meanfilt2D-meanfiltU was not identically zero, but the maximum element was 2; I think the differences are caused by rounding and the imprecise datatype (uint8) used for im.)
For variance and other filters, you should see this old Stack Overflow post which answers a highly related question.

Numpy: averaging many datapoints at each time step

This question is probably answered somewhere, but I cannot find where, so I will ask here:
I have a set of data consisting of several samples per timestep. So, I basically have two arrays, "times", which looks something like: (0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,...) and my data which is the value for each time. Each timestep has a random number of samples. I would like to get the average value of the data at each timestep in an efficient manner.
I have prepared the following sample code to show what my data looks like. Basically, I am wondering if there is a more efficient way to write the "average_values" function.
import numpy as np
import matplotlib.pyplot as plt
def average_values(x,y):
unique_x = np.unique(x)
averaged_y = [np.mean(y[x==ux]) for ux in unique_x]
return unique_x, averaged_y
#generate our data
times = []
samples = []
#we have some timesteps:
for time in np.linspace(0,10,101):
#and a random number of samples at each timestep:
num_samples = np.random.random_integers(1,10)
for i in range(0,num_samples):
times.append(time)
samples.append(np.sin(time)+np.random.random()*0.5)
times = np.array(times)
samples = np.array(samples)
plt.plot(times,samples,'bo',ms=3,mec=None,alpha=0.5)
plt.plot(*average_values(times,samples),color='r')
plt.show()
Here is what it looks like:

A generic code to do this would do something as follows:
def average_values_bis(x, y):
unq_x, idx = np.unique(x, return_inverse=True)
count_x = np.bincount(idx)
sum_y = np.bincount(idx, weights=y)
return unq_x, sum_y / count_x
Adding the function above and following line for the plotting to your script
plt.plot(*average_values_bis(times, samples),color='g')
produces this output, with the red line hidden behind the green one:
But timing both approaches reveals the benefits of using bincount, a 30x speed-up:
%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop
%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop

May I propose a pandas solution. It is highly recommended if you are going to be working with time series.
Create test data
import pandas as pd
import numpy as np
times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')
Calculate averages
avs = s.groupby(level=0).mean()
avs.plot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concurrently reading numpy arrays in parallel - python

You can achieve the same result in a vectorized form by simply doing: >>> (fine / np.repeat(coarse, 2)).reshape(-1, 2) then you may filter out rows which coarse is zero, by using np.isfinite since if coarse is zero the output is either inf or nan.

Related

Any way to speed up runs?

Integrating Out Dimension from MultiDimensional Array using Parallel Processing

Accelerating scientific python program

Rolling Statistics on Image Python

Numpy: averaging many datapoints at each time step

Categories

Resources