How can I vectorize this python for loop?

How can I vectorize this python for loop? - python

I am trying to count the number of events with various thresholds. So I used for loop to use it as thresholds but the number of events is too many so it takes too much time.
So I want to vectorize this macro and reduce compute time. Can I get some help?
array_ = np.array(bin_number)
for i in range(bin_number):
mask_1 = array_ML[:,0] > i
masked_array = array_ML[mask_1]
mask_2 = masked_array[:,2] == 0
masked_array = masked_array[mask_2]
array_[i] = masked_array.shape[0]

There may be a dedicated function in NumPy that does this for you, but otherwise, the following simplifications are likely to speed up your code significantly:
import numpy as np
# Create example data
array_ML = np.random.randint(0, 1000, (10000, 200))
array_ML[:, 2] = np.where(array_ML[:, 2] > 500, 0, 1)
bin_number = 100
array_ = np.zeros(bin_number, dtype=int)
# filter what we can, before the loop
mask = array_ML[:, 2] == 0
temp = array_ML[mask, 0]
# Just count, by summing the condition
for i in range(bin_number):
array_[i] = np.sum(temp > i)
With the above example data, my timings (using %%time in Jupyter notebook cells) reduce from 439 ms (original code) to 3.86 ms (code above).
Of course, the timing decreases are heavily dependent on your input data shape, distribution of data, and bin_number; my timings serve as an indication.

Related

What is the most efficient way to deal with a loop on NumPy arrays?

The question is simple: here is my current algorithm. This is terribly slow because of the loops on the arrays. Is there a way to change it in order to avoid the loops and take advantage of the NumPy arrays types ?
import numpy as np
def loopingFunction(listOfVector1, listOfVector2):
resultArray = []
for vector1 in listOfVector1:
result = 0
for vector2 in listOfVector2:
result += np.dot(vector1, vector2) * vector2[2]
resultArray.append(result)
return np.array(resultArray)
listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)
listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))
listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)
listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
result = loopingFunction(listOfVector1, listOfVector2)
I am supposed to deal with really big arrays, that have way more than 1000 vectors in each. So if you have any advice, I'll take it.

The obligatory np.einsum benchmark
r2 = np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:,2], optimize=['einsum_path', (1, 2), (0, 1)])
#%timeit result: 10000 loops, best of 5: 116 µs per loop
np.testing.assert_allclose(result, r2)

Just for fun, I wrote an optimized Numba implementation that outperform all others. It is based on the einsum optimization of the #MichaelSzczesny answer.
import numpy as np
import numba as nb
# This decorator ask Numba to eagerly compile the code using
# the provided signature string (containing the parameter types).
#nb.njit('(float64[:,::1], float64[:,::1])')
def loopingFunction_numba(listOfVector1, listOfVector2):
n, m = listOfVector1.shape
assert m == 3
result = np.empty(n)
s1 = s2 = s3 = 0.0
for i in range(n):
factor = listOfVector2[i, 2]
s1 += listOfVector2[i, 0] * factor
s2 += listOfVector2[i, 1] * factor
s3 += listOfVector2[i, 2] * factor
for i in range(n):
result[i] = listOfVector1[i, 0] * s1 + listOfVector1[i, 1] * s2 + listOfVector1[i, 2] * s3
return result
result = loopingFunction_numba(listOfVector1, listOfVector2)
Here are timings on my i5-9600KF processor:
Initial: 1052.0 ms
ymmx: 5.121 ms
MichaelSzczesny: 75.40 us
MechanicPig: 3.36 us
Numba: 2.74 us
Optimal lower bound: 0.66 us
This solution is ~384_000 times faster than the original one. Note that is does not even use the SIMD instructions of the processor that would result in a ~4x speed up on my machine. This is only possible by having transposed input that are much more SIMD-friendly than the current one. Transposition may also speed up other answers like the one of MechanicPig since BLAS can often benefit from this. The resulting code would reach the symbolic 1_000_000 speed up factor!

You can at least remove the two forloop to save alot of time, use matrix computation directly
import time
import numpy as np
def loopingFunction(listOfVector1, listOfVector2):
resultArray = []
for vector1 in listOfVector1:
result = 0
for vector2 in listOfVector2:
result += np.dot(vector1, vector2) * vector2[2]
resultArray.append(result)
return np.array(resultArray)
def loopingFunction2(listOfVector1, listOfVector2):
resultArray = np.sum(np.dot(listOfVector1, listOfVector2.T) * listOfVector2[:,2], axis=1)
return resultArray
listOfVector1x = np.linspace(0,0.33,1000)
listOfVector1y = np.linspace(0.33,0.66,1000)
listOfVector1z = np.linspace(0.66,1,1000)
listOfVector1 = np.column_stack((listOfVector1x, listOfVector1y, listOfVector1z))
listOfVector2x = np.linspace(0.33,0.66,1000)
listOfVector2y = np.linspace(0.66,1,1000)
listOfVector2z = np.linspace(0, 0.33, 1000)
listOfVector2 = np.column_stack((listOfVector2x, listOfVector2y, listOfVector2z))
import time
t0 = time.time()
result = loopingFunction(listOfVector1, listOfVector2)
print('time old version',time.time() - t0)
t0 = time.time()
result2 = loopingFunction2(listOfVector1, listOfVector2)
print('time matrix computation version',time.time() - t0)
print('Are results are the same',np.allclose(result,result2))
Which gives
time old version 1.174513578414917
time matrix computation version 0.011968612670898438
Are results are the same True
Basically, the less loop the better.

Avoid nested loops and adjust the calculation order, which is 20 times faster than the optimized np.einsum and nearly 400_000 times faster than the original program:
>>> out = listOfVector1.dot(listOfVector2[:, 2].dot(listOfVector2))
>>> np.allclose(out, loopingFunction(listOfVector1, listOfVector2))
True
Test:
>>> timeit(lambda: loopingFunction(listOfVector1, listOfVector2), number=1)
1.4389081999834161
>>> timeit(lambda: listOfVector1.dot(listOfVector2[:, 2].dot(listOfVector2)), number=400_000)
1.3162514999858104
>>> timeit(lambda: np.einsum('ij, kj, k->i', listOfVector1, listOfVector2, listOfVector2[:, 2], optimize=['einsum_path', (1, 2), (0, 1)]), number=18_000)
1.3501517999975476

How sample from a linspace without replacement in batches

I'd like to sample n random numbers from a linspace without replacement and do so in batches. Thus, each sample in the batch should not have repeated numbers, but numbers may repeat across the batch.
The following code shows how I do it by calling Generator.choice repeatedly.
import numpy as np
low, high = 0, 10
sample_shape = (3,)
n = 5
rng = np.random.default_rng() # or previously instantiated RNG
space = np.linspace(start=low, stop=high, num=1000)
samples = np.stack(
[
rng.choice(space, size=n, replace=False)
for _ in range(np.prod(sample_shape, dtype=int))
]
)
samples = samples.reshape(sample_shape + (n,))
print(f"samples.shape: {samples.shape}")
print(samples)
Current output:
samples.shape: (3, 5)
[[4.15415415 5.56556557 1.38138138 7.78778779 7.03703704]
[1.48148148 6.996997 0.91091091 3.28328328 2.93293293]
[7.82782783 9.65965966 9.94994995 5.84584585 5.26526527]]
However, this procedure turns out to be a big bottleneck in my code. Is there a more efficient way of performing this?

Parallelize For loops in Numpy and Dask for creating a multi-dimensional histogram

I have three 4-dimensional arrays need to be binned to create a multi-dimensional histogram. In the example below, I have used numpy, but the actual arrays are being read in from a NetCDF file using xarray.
I know that xarray uses dask in the backend, and I have tried creating a small dask cluster on the machine I am using, which has 20 cores, but I don't get any speedup in the for loops, but I do get the speedup in the digitize step.
I am hoping someone can help me parallelize the for loop based on dask.
import numpy as np
# Initial datasets
s = np.random.rand(5,2,3,4)
ws = np.random.rand(5,2,3,4)
wd = np.random.rand(5, 2, 3, 4)
# Digitize to different bins
s_map = np.digitize(s, [0, .5, 1])
ws_map = np.digitize(ws, [0, .25, .5, .75, 1])
wd_map = np.digitize(wd, [.25, .5, 1])
# Get indexes that have values
s_ids = np.unique(s_map)
ws_ids = np.unique(ws_map)
wd_ids = np.unique(wd_map)
# Create output array
count = np.zeros((s_ids.size, ws_ids.size, wd_ids.size) + s.shape[1:])
# Loop over each of the maps to count how many values fall into each bin
for i, s_id in enumerate(s_ids):
s_mask = s_map == s_id
for j, ws_id in enumerate(ws_ids):
ws_mask = s_mask & (ws_map == ws_id)
for k, wd_id in enumerate(wd_ids):
mask = ws_mask & (wd_map == wd_id)
count[i, j, k, ...] += np.count_nonzero(mask, axis=0)

geodesic distance transform in python

In python there is the distance_transform_edt function in the scipy.ndimage.morphology module. I applied it to a simple case, to compute the distance from a single cell in a masked numpy array.
However the function remove the mask of the array and compute, as expected, the Euclidean distance for each cell, with non null value, from the reference cell, with the null value.
Below is an example I gave in my blog post:
%pylab
from scipy.ndimage.morphology import distance_transform_edt
l = 100
x, y = np.indices((l, l))
center1 = (50, 20)
center2 = (28, 24)
center3 = (30, 50)
center4 = (60,48)
radius1, radius2, radius3, radius4 = 15, 12, 19, 12
circle1 = (x - center1[0])**2 + (y - center1[1])**2 < radius1**2
circle2 = (x - center2[0])**2 + (y - center2[1])**2 < radius2**2
circle3 = (x - center3[0])**2 + (y - center3[1])**2 < radius3**2
circle4 = (x - center4[0])**2 + (y - center4[1])**2 < radius4**2
# 3 circles
img = circle1 + circle2 + circle3 + circle4
mask = ~img.astype(bool)
img = img.astype(float)
m = ones_like(img)
m[center1] = 0
#imshow(distance_transform_edt(m), interpolation='nearest')
m = ma.masked_array(distance_transform_edt(m), mask)
imshow(m, interpolation='nearest')
However I want to compute the geodesic distance transform that take into account the masked elements of the array. I do not want to compute the Euclidean distance along a straight line that go through masked elements.
I used The Dijkstra algorithm to obtain the result I want. Below is the implementation I proposed:
def geodesic_distance_transform(m):
mask = m.mask
visit_mask = mask.copy() # mask visited cells
m = m.filled(numpy.inf)
m[m!=0] = numpy.inf
distance_increments = numpy.asarray([sqrt(2), 1., sqrt(2), 1., 1., sqrt(2), 1., sqrt(2)])
connectivity = [(i,j) for i in [-1, 0, 1] for j in [-1, 0, 1] if (not (i == j == 0))]
cc = unravel_index(m.argmin(), m.shape) # current_cell
while (~visit_mask).sum() > 0:
neighbors = [tuple(e) for e in asarray(cc) - connectivity
if not visit_mask[tuple(e)]]
tentative_distance = [distance_increments[i] for i,e in enumerate(asarray(cc) - connectivity)
if not visit_mask[tuple(e)]]
for i,e in enumerate(neighbors):
d = tentative_distance[i] + m[cc]
if d < m[e]:
m[e] = d
visit_mask[cc] = True
m_mask = ma.masked_array(m, visit_mask)
cc = unravel_index(m_mask.argmin(), m.shape)
return m
gdt = geodesic_distance_transform(m)
imshow(gdt, interpolation='nearest')
colorbar()
The function implemented above works well but is too slow for the application I developed which needs to compute the geodesic distance transform several times.
Below is the time benchmark of the euclidean distance transform and the geodesic distance transform:
%timeit distance_transform_edt(m)
1000 loops, best of 3: 1.07 ms per loop
%timeit geodesic_distance_transform(m)
1 loops, best of 3: 702 ms per loop
How can I obtained a faster geodesic distance transform?

First of all, thumbs up for a very clear and well written question.
There is a very good and fast implementation of a Fast Marching method called scikit-fmm to solve this kind of problem. You can find the details here:
http://pythonhosted.org//scikit-fmm/
Installing it might be the hardest part, but on Windows with Conda its easy, since there is 64bit Conda package for Py27:
https://binstar.org/jmargeta/scikit-fmm
From there on, just pass your masked array to it, as you do with your own function. Like:
distance = skfmm.distance(m)
The results looks similar, and i think even slightly better. Your approach searches (apparently) in eight distinct directions resulting in a bit of a 'octagonal-shaped` distance.
On my machine the scikit-fmm implementation is over 200x faster then your function.

64-bit Windows binaries for scikit-fmm are now available from Christoph Gohlke.
http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-fmm

A slightly faster (about 10x) implementation that achieves the same result as your geodesic_distance_transform:
def getMissingMask(slab):
nan_mask=numpy.where(numpy.isnan(slab),1,0)
if not hasattr(slab,'mask'):
mask_mask=numpy.zeros(slab.shape)
else:
if slab.mask.size==1 and slab.mask==False:
mask_mask=numpy.zeros(slab.shape)
else:
mask_mask=numpy.where(slab.mask,1,0)
mask=numpy.where(mask_mask+nan_mask>0,1,0)
return mask
def geodesic(img,seed):
seedy,seedx=seed
mask=getMissingMask(img)
#----Call distance_transform_edt if no missing----
if mask.sum()==0:
slab=numpy.ones(img.shape)
slab[seedy,seedx]=0
return distance_transform_edt(slab)
target=(1-mask).sum()
dist=numpy.ones(img.shape)*numpy.inf
dist[seedy,seedx]=0
def expandDir(img,direction):
if direction=='n':
l1=img[0,:]
img=numpy.roll(img,1,axis=0)
img[0,:]==l1
elif direction=='s':
l1=img[-1,:]
img=numpy.roll(img,-1,axis=0)
img[-1,:]==l1
elif direction=='e':
l1=img[:,0]
img=numpy.roll(img,1,axis=1)
img[:,0]=l1
elif direction=='w':
l1=img[:,-1]
img=numpy.roll(img,-1,axis=1)
img[:,-1]==l1
elif direction=='ne':
img=expandDir(img,'n')
img=expandDir(img,'e')
elif direction=='nw':
img=expandDir(img,'n')
img=expandDir(img,'w')
elif direction=='sw':
img=expandDir(img,'s')
img=expandDir(img,'w')
elif direction=='se':
img=expandDir(img,'s')
img=expandDir(img,'e')
return img
def expandIter(img):
sqrt2=numpy.sqrt(2)
tmps=[]
for dirii,dd in zip(['n','s','e','w','ne','nw','sw','se'],\
[1,]*4+[sqrt2,]*4):
tmpii=expandDir(img,dirii)+dd
tmpii=numpy.minimum(tmpii,img)
tmps.append(tmpii)
img=reduce(lambda x,y:numpy.minimum(x,y),tmps)
return img
#----------------Iteratively expand----------------
dist_old=dist
while True:
expand=expandIter(dist)
dist=numpy.where(mask,dist,expand)
nc=dist.size-len(numpy.where(dist==numpy.inf)[0])
if nc>=target or numpy.all(dist_old==dist):
break
dist_old=dist
return dist
Also note that if the mask forms more than 1 connected regions (e.g. adding another circle not touching the others), your function will fall into an endless loop.
UPDATE:
I found one Cython implementation of Fast Sweeping method in this notebook, which can be used to achieve the same result as scikit-fmm with probably comparable speed. One just need to feed a binary flag matrix (with 1s as viable points, inf otherwise) as the cost to the GDT() function.

Speed up Matplotlib?

I've read here that matplotlib is good at handling large data sets. I'm writing a data processing application and have embedded matplotlib plots into wx and have found matplotlib to be TERRIBLE at handling large amounts of data, both in terms of speed and in terms of memory. Does anyone know a way to speed up (reduce memory footprint of) matplotlib other than downsampling your inputs?
To illustrate how bad matplotlib is with memory consider this code:
import pylab
import numpy
a = numpy.arange(int(1e7)) # only 10,000,000 32-bit integers (~40 Mb in memory)
# watch your system memory now...
pylab.plot(a) # this uses over 230 ADDITIONAL Mb of memory

Downsampling is a good solution here -- plotting 10M points consumes a bunch of memory and time in matplotlib. If you know how much memory is acceptable, then you can downsample based on that amount. For example, let's say 1M points takes 23 additional MB of memory and you find it to be acceptable in terms of space and time, therefore you should downsample so that it's always below the 1M points:
if(len(a) > 1M):
a = scipy.signal.decimate(a, int(len(a)/1M)+1)
pylab.plot(a)
Or something like the above snippet (the above may downsample too aggressively for your taste.)

I'm often interested in the extreme values too so, before plotting large chunks of data, I proceed in this way:
import numpy as np
s = np.random.normal(size=(1e7,))
decimation_factor = 10
s = np.max(s.reshape(-1,decimation_factor),axis=1)
# To check the final size
s.shape
Of course np.max is just an example of extreme calculation function.
P.S.
With numpy "strides tricks" it should be possible to avoid copying data around during reshape.

I was interested in preserving one side of a log sampled plot so I came up with this:
(downsample being my first attempt)
def downsample(x, y, target_length=1000, preserve_ends=0):
assert len(x.shape) == 1
assert len(y.shape) == 1
data = np.vstack((x, y))
if preserve_ends > 0:
l, data, r = np.split(data, (preserve_ends, -preserve_ends), axis=1)
interval = int(data.shape[1] / target_length) + 1
data = data[:, ::interval]
if preserve_ends > 0:
data = np.concatenate([l, data, r], axis=1)
return data[0, :], data[1, :]
def geom_ind(stop, num=50):
geo_num = num
ind = np.geomspace(1, stop, dtype=int, num=geo_num)
while len(set(ind)) < num - 1:
geo_num += 1
ind = np.geomspace(1, stop, dtype=int, num=geo_num)
return np.sort(list(set(ind) | {0}))
def log_downsample(x, y, target_length=1000, flip=False):
assert len(x.shape) == 1
assert len(y.shape) == 1
data = np.vstack((x, y))
if flip:
data = np.fliplr(data)
data = data[:, geom_ind(data.shape[1], num=target_length)]
if flip:
data = np.fliplr(data)
return data[0, :], data[1, :]
which allowed me to better preserve one side of plot:
newx, newy = downsample(x, y, target_length=1000, preserve_ends=50)
newlogx, newlogy = log_downsample(x, y, target_length=1000)
f = plt.figure()
plt.gca().set_yscale("log")
plt.step(x, y, label="original")
plt.step(newx, newy, label="downsample")
plt.step(newlogx, newlogy, label="log_downsample")
plt.legend()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I vectorize this python for loop? - python

Related

What is the most efficient way to deal with a loop on NumPy arrays?

How sample from a linspace without replacement in batches

Parallelize For loops in Numpy and Dask for creating a multi-dimensional histogram

geodesic distance transform in python

Speed up Matplotlib?

Categories

Resources