I want to compute a moving percentile over the first axis of a 4D array. First I created the rolling windows this via stacking rolled arrays along a new dimension, but later found that views via np.lib.stride_tricks.as_strided() are much (more than an order-of-magnitude) faster. Afterwards, I compute the percentile over the new axis. In order to have truncated windows near the edge while keeping the same window length, I padded the array with NaNs before obtaining the rolling windows, and then apply np.nanpercentile(). However, I noticed that np.nanpercentile() is very slow and tested it with the example code below.
Code:
import numpy as np
import time
t = []
np.random.seed(100)
a = np.random.rand(26,72,73,273)
p = 0.7
for i in range(3):
t.append(time.time())
res_perc = np.percentile(a,p*100,axis=-1)
t.append(time.time())
res_nanperc = np.nanpercentile(a,p*100,axis=-1)
t.append(time.time())
print('compute percentile:', t[-2]-t[-3])
print('compute nanpercentile:', t[-1]-t[-2])
Output:
compute percentile: 1.0262951850891113
compute nanpercentile: 24.713390827178955
compute percentile: 0.9878458976745605
compute nanpercentile: 24.505929470062256
compute percentile: 0.984771728515625
compute nanpercentile: 24.510523080825806
Questions: How comes np.nanpercentile() is so much (> factor 20) slower than np.percentile(), even if there are actually no NaN values in the array? Does the check whether there are any NaN values really take that long and why is that? The reason I am asking is that in my actual application, only 2 out of the 26 rolling windows in the first axis (those at the edges) will actually have contain NaNs, representing truncated windows. If I can actually save a factor of more than 20 for the remaining 24 rolling windows, I might just apply np.percentile() for those and np.nanpercentile() for the outer two rolling windows and then stitch the resulting values back together, significantly speeding up my code.
Background: I have millions of points in 2D space with (x_position, y_position, value) associated with each point. I am trying to summarize these points by creating an image, where each pixel can contain multiple points. To summarize, each pixel stores the sum of values at that (x_pixel, y_pixel) location in the image.
Question: How can I do this efficiently? Currently, my code does something like this:
image = np.zeros((4096,4096))
for each point in data:
x_pixel, y_pixel = convertPointPos2PixelPos(point)
image[x_pixel, y_pixel] += point.getValue()
but the ETA for this code completing is 450 hours, which is unacceptable. Is there a way to parallelize this? The code is writing to the same image[x,y] index multiple times. I found StackOverflow posts that suggest using multiprocessing, but I think needing to lock to prevent race conditions will mean this will take just as much time as it would without parallelizing.
Assuming you want something on a regular grid, you can use simple division to bin your data. Here is an example:
size = (4096, 4096)
data = np.random.rand(100000000, 3)
image = np.zeros(size)
coords = data[:, :2]
min = coords.min(0)
max = coords.max(0)
index = np.floor_divide(coords - min, (max - min) / np.subtract(size, 1), out=np.empty(coords.shape, dtype=int), casting='unsafe')
index is now an array of indices into image where you want to add the corresponding values. You can do an unbuffered add using np.add.at:
np.add.at(image, tuple(index.T), data[:, -1])
If your data range is better defined than just the bounding box of the coordinates, you can save a little time by not computing coord.max() and coord.min().
The result is something like this:
This entire operation takes 6.4sec on my very moderately powered machine for 10M points, including the call to plt.imshow, plt.colorbar and garbage collection before runs.
Timing collected using the %%timeit cell magic in IPython.
Either way, you're well under 450 hours. Even if your coordinate transformation is not linear binning, I expect that you can run in reasonable time as long as you vectorize it properly. Also, multiprocessing is not likely to give you a huge boost since it requires copying data around.
I'm trying to write a section of code that computes the curl of a vector field numerically to second order with periodic boundary conditions. However, the algorithm I made is very slow and I'm wondering if anyone knows of any alternative algorithms.
To give more specific context: I'm using a 3xAxBxC numpy array as my vector field where the first axis refers to the Cartesian direction (x,y,z) and A,B,C refer to the number of bins in that Cartesian direction (i.e the resolution). So for example, I might have a vector field F = np.zeros((3,64,64,64)) where Fx = F[0] is a 64x64x64 Cartesian lattice in its own right. So far, my solution was to use the 3-point centered difference stencil to calculate the derivatives and used a nested loop to iterate over all the different dimensions using modular arithmetic to enforce the periodic boundary conditions (see below for example). However, as my resolution increases (the size of A,B,C) this begins to take a long time (upwards 2 minutes, which adds up if I do this several hundred times for my simulation - this is just one small part of a larger algorithm). I was wondering if anyone know of an alternative method for doing this?
import numpy as np
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
Just to re-iterate, I'm looking for an algorithm that'll compute the curl of a vector field array to second order given periodic boundary conditions faster than the one I have. Maybe there's nothing that will do this, but I just want to check before I keep spending time running this algorithm. Thank. you everyone in advance!
There may be better tools for this, but here is a trivial 200x speedup with numba:
import numpy as np
from numba import jit
def pure_python():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
#jit(fastmath=True)
def with_numba():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
The pure Python version takes 13 seconds on my machine, while the numba version takes 65 ms.
I have mask images with size (N, 256, 256), where N is a value between 1000-10000.
Each pixel has an integer value between 0-2 (0 is just background).
Unfortunately, the mask image is not encoded as (N,256,256,2)
I have a few thousands of these masks. My goal is to find the quickest method counting pixels per frame for each label (1 and 2).
Running below on one mask image with roughly 6000 frames using numpy takes < 2s.
np.sum(ma==1,axis=(1,2))
np.sum(ma==2,axis=(1,2))
I expect it will take a few hours to run on entire data if I use single process, and maybe less than an hour if I use multiprocessing (CPU).
I'm curious if I can make it even faster if I use GPU. It seems easy to implement the part summing a tensor on axes, but I don't find how I can implement the ma==1 part on tensorflow.
I thought about making the input to encoded shape (N,256,256,2) first and pass to the tensor placeholder, but realized it would take even longer than above to make an array with that shape.
Or, is there a better way to implementing pixel count on this mask data using tensorflow?
Think about what's going on in the background
Roughly the following steps are done twice in your original implementation:
Loading the whole array from memory, prove if a value equals the desired value
Writing the results back to memory (the temporary array is as large as your input array,np.uint8 assumed)
Loading the whole array into memory and sum up the results
Writing the results back to memory
It should be clear that this is a quite suboptimal implementation parallelized or not. I could not do it any better in a pure vectorized numpy way, but there are tools available (Numba, Cython) where you can implement this task in a more direct and paralellized way.
Example
import numpy as np
import numba as nb
import time
#Create some data
N=10000
images=np.random.randint(0, high=3, size=(N,256,256), dtype=np.uint8)
def sum_orig(ma):
A=np.sum(ma==1,axis=(1,2))
B=np.sum(ma==2,axis=(1,2))
return A,B
#nb.njit(fastmath=True,parallel=True)
def sum_mod(ma):
A=np.zeros(ma.shape[0],dtype=np.uint32)
B=np.zeros(ma.shape[0],dtype=np.uint32)
#parallel loop
for i in nb.prange(ma.shape[0]):
AT=0
BT=0
for j in range(ma.shape[1]):
for k in range(ma.shape[2]):
if (ma[i,j,k]==1):
AT+=1
if (ma[i,j,k]==2):
BT+=1
A[i]=AT
B[i]=BT
return A,B
#Warm up
#The funtion is compiled at the first call
[A,B]=sum_mod(images)
t1=time.time()
[A,B]=sum_mod(images)
print(time.time()-t1)
t1=time.time()
[A_,B_]=sum_orig(images)
print(time.time()-t1)
#check if it works correctly
print(np.allclose(A,A_))
print(np.allclose(B,B_))
Performance
improved_version: 0.06s
original_version: 2.07s
speedup: 33x
I'm trying to implement the Minimum Distance Algorithm for image classification using GDAL and Python. After calculating the mean pixel-value of the sample areas and storing them into a list of arrays ("sample_array"), I read the image into an array called "values". With the following code I loop through this array:
values = valBD.ReadAsArray()
# loop through pixel columns
for X in range(0,XSize):
# loop thorugh pixel lines
for Y in range (0, YSize):
# initialize variables
minDist = 9999
# get minimum distance
for iSample in range (0, sample_count):
# dist = calc_distance(values[jPixel, iPixel], sample_array[iSample])
# computing minimum distance
iPixelVal = values[Y, X]
mean = sample_array[iSample]
dist = math.sqrt((iPixelVal - mean) * (iPixelVal - mean)) # only for testing
if dist < minDist:
minDist = dist
values[Y, X] = iSample
classBD.WriteArray(values, xoff=0, yoff=0)
This procedure takes very long for big images. That's why I want to ask if somebody knows a faster method. I don't know much about access-speed of different variables in python. Or maybe someone knows a libary I could use.
Thanks in advance,
Mario
You should definitely be using NumPy. I work with some pretty large raster datasets and NumPy burns through them. On my machine, with the code below there's no noticeable delay for a 1000 x 1000 array. An explanation of how this works follows the code.
import numpy as np
from scipy.spatial.distance import cdist
# some starter data
dim = (1000,1000)
values = np.random.randint(0, 10, dim)
# cdist will want 'samples' as a 2-d array
samples = np.array([1, 2, 3]).reshape(-1, 1)
# this could be a one-liner
# 'values' must have the same number of columns as 'samples'
mins = cdist(values.reshape(-1, 1), samples)
outvalues = mins.argmin(axis=1).reshape(dim)
cdist() calculates the "distance" from each element in values to each of the elements in samples. This generates a 1,000,000 x 3 array, where each row n has the distance from pixel nin the original array to each of the sample values [1, 2, 3]. argmin(axis=1) gives you the index of the minimum value along each row, which is what you want. A quick reshape gives you the rectangular format you'd expect for an image.
Agree with Thomas K: use PIL, or else write a C-function and wrap it using e.g. ctypes, or at very least use some numPy matrix operations.
Or else use pypy on your existing code (JIT-compiled code can be 100x faster, on image code). Try pypy and tell us what speedup you got.
Bottom line: never do stuff pixel-wise like this natively in cPython, the interpreting and memory-mgt overhead will kill you.