I want to summarize a 3d array dat using indices contained in a 2d array idx.
Consider the example below. For each margin along dat[:, :, i], I want to compute the median according to some index idx. The desired output (out) is a 2d array, whose rows record the index and columns record the margin. The following code works but is not very efficient. Any suggestions?
import numpy as np
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
out = np.empty((3, 3))
for i in np.unique(idx):
out[i,] = np.median(dat[idx==i], axis = 0)
print(out)
Output:
[[ 1.5 2.5 3.5]
[ 6. 7. 8. ]
[ 9. 10. 11. ]]
To visualize the problem better, I will refer to the 2x2 dimensions of the array as the rows and columns, and the 3 dimension as depth. I will refer to vectors along the 3rd dimension as "pixels" (pixels have length 3), and planes along the first two dimensions as "channels".
Your loop is accumulating a set of pixels selected by the mask idx == i, and taking the median of each channel within that set. The result is an Nx3 array, where N is the number of distinct incides that you have.
One day, generalized ufuncs will be ubiquitous in numpy, and np.median will be such a function. On that day, you will be able to use reduceat magic1 to do something like
unq, ind = np.unique(idx, return_inverse=True)
np.median.reduceat(dat.reshape(-1, dat.shape[-1]), np.r_[0, np.where(np.diff(unq[ind]))[0]+1])
1 See Applying operation to unevenly split portions of numpy array for more info on the specific type of magic.
Since this is not currently possible, you can use scipy.ndimage.median instead. This version allows you to compute medians over a set of labeled areas in an array, which is exactly what you have with idx. This method assumes that your index array contains N densely packed values, all of which are in range(N). Otherwise the reshaping operations will not work properly.
If that is not the case, start by transforming idx:
_, ind = np.unique(idx, return_inverse=True)
idx = ind.reshape(idx.shape)
OR
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
Since you are actually computing a separate median for each region and channel, you will need to have a set of labels for each channel. Flesh out idx to have a distinct set of indices for each channel:
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
Now index has an identical set of regions defined in each channel, which you can use in scipy.ndimage.median:
out = scipy.ndimage.median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
The input labels must be densely packed from zero to offset * chan for index=range(offset * chan) to work properly, and the reshape operation to have the right number of elements. The final transpose is just an artifact of how the labels are arranged.
Here is the complete product, along with an IDEOne demo of the result:
import numpy as np
from scipy.ndimage import median
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
def summarize(dat, idx):
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
return median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
print(summarize(dat, idx))
Related
I am trying to calculate the average of a 3D array between two indices on the 1st axis. The start and end indices vary from cell to cell and are represented by two separate 2D arrays that are the same shape as a slice of the 3D array.
I have managed to implement a piece of code that loops through the pixels of my 3D array, but this method is painfully slow in the case of my array with a shape of (70, 550, 350). Is there a way to vectorise the operation using numpy or xarray (the arrays are stored in an xarray dataset)?
Here is a snippet of what I would like to optimise:
# My 3D raster containing values; shape = (time, x, y)
values = np.random.rand(10, 55, 60)
# A 2D raster containing start indices for the averaging
start_index = np.random.randint(0, 4, size=(values.shape[1], values.shape[2]))
# A 2D raster containing end indices for the averaging
end_index = np.random.randint(5, 9, size=(values.shape[1], values.shape[2]))
# Initialise an array that will contain results
mean_array = np.zeros_like(values[0, :, :])
# Loop over 3D raster to calculate the average between indices on axis 0
for i in range(0, values.shape[1]):
for j in range(0, values.shape[2]):
mean_array[i, j] = np.mean(values[start_index[i, j]: end_index[i, j], i, j], axis=0)
One way to do this without loops is to zero-out the entries you don't want to use, compute the sum of the remaining items, then divide by the number of nonzero entries. For example:
i = np.arange(values.shape[0])[:, None, None]
mean_array_2 = np.where((i >= start_index) & (i < end_index), values, 0).sum(0) / (end_index - start_index)
np.allclose(mean_array, mean_array_2)
# True
Note that this assumes that the indices are in the range 0 <= i < values.shape[0]; if this is not the case you can use np.clip or other means to standardize the indices before computation.
Let's say I have 2 arrays of arrays, labels is 1D and data is 5D note that both arrays have the same first dimension.
To simplify things let's say labels contain only 3 arrays :
labels=np.array([[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]])
And let's say I have a datalist of data arrays (length=3) where each array has a 5D shape where the first dimension of each one is the same as the arrays of the labels array.
In this example, datalist has 3 arrays of shapes : (8,3,100,10,1), (5,3,100,10,1) and (10,3,100,10,1) respectively. Here, the first dimension of each of these arrays is the same as the lengths of each array in label.
Now I want to reduce the number of zeros in each array of labels and keep the other values. Let's say I want to keep only 3 zeros for each array. Therefore, the length of each array in labels as well as the first dimension of each array in data will be 6, 4 and 8.
In order to reduce the number of zeros in each array of labels, I want to randomly select and keep only 3. Now these same random selected indexes will be used then to select the correspondant rows from data.
For this example, the new_labels array will be something like this :
new_labels=np.array([[0,0,1,1,2,0],[4,0,0,0],[0,3,2,1,0,1,7,0]])
Here's what I have tried so far :
all_ind=[] #to store indexes where value=0 for all arrays
indexes_to_keep=[] #to store the random selected indexes
new_labels=[] #to store the final results
for i in range(len(labels)):
ind=[] #to store indexes where value=0 for one array
for j in range(len(labels[i])):
if (labels[i][j]==0):
ind.append(j)
all_ind.append(ind)
for k in range(len(labels)):
indexes_to_keep.append(np.random.choice(all_ind[i], 3))
aux= np.zeros(len(labels[i]) - len(all_ind[i]) + 3)
....
....
Here, how can I fill **aux** with the values ?
....
....
new_labels.append(aux)
Any suggestions ?
Playing with numpy arrays of different lenghts is not a good idea therefore you are required to iterate each item and perform some method on it. Assuming you want to optimize that method only, masking might work pretty well here:
def specific_choice(x, n):
'''leaving n random zeros of the list x'''
x = np.array(x)
mask = x != 0
idx = np.flatnonzero(~mask)
np.random.shuffle(idx) #dynamical change of idx value, quite fast
idx = idx[:n]
mask[idx] = True
return x[mask] # or mask if you need it
Iteration of list is faster than one of array so effective usage would be:
labels = [[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]]
output = [specific_choice(n, 3) for n in labels]
Output:
[array([0, 1, 1, 2, 0, 0]), array([0, 4, 0, 0]), array([0, 3, 0, 2, 1, 1, 7, 0])]
I'm certain there's a good way to do this but I'm blanking on the right search terms to google, so I'll ask here instead. My problem is this:
I have 2 2-dimensional array, both with the same dimensions. One array (array 1) is the accumulated precipitation at (x,y) points. The other (array 2) is the topographic height of the same (x,y) grid. I want to sum up array 1 between specific heights of array 2, and create a bar graph with topographic height bins a the x-axis and total accumulated precipitation on the y axis.
So I want to be able to declare a list of heights (say [0, 100, 200, ..., 1000]) and for each bin, sum up all precipitation that occurred within that bin.
I can think of a few complicated ways to do this, but I'm guessing there's probably an easier way that I'm not thinking of. My gut instinct is to loop through my list of heights, mask anything outside of that range, sum up remaining values, add those to a new array, and repeat.
I'm wondering is if there's a built-in numpy or similar library that can do this more efficiently.
This code shows what you're asking for, some explanation in comments:
import numpy as np
def in_range(x, lower_bound, upper_bound):
# returns wether x is between lower_bound (inclusive) and upper_bound (exclusive)
return x in range(lower_bound, upper_bound)
# vectorize allows you to easily 'map' the function to a numpy array
vin_range = np.vectorize(in_range)
# representing your rainfall
rainfall = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# representing your height map
height = np.array([[1, 2, 1], [2, 4, 2], [3, 6, 3]])
# the bands of height you're looking to sum
bands = [[0, 2], [2, 4], [4, 6], [6, 8]]
# computing the actual results you'd want to chart
result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands]
print(result)
The next to last line is where the magic happens. vin_range(height, *band) uses the vectorized function to create a numpy array of boolean values, with the same dimensions as height, that has True if a value of height is in the range given, or False otherwise.
By using that array to index the array with the target values (rainfall), you get an array that only has the values for which the height is in the target range. Then it's just a matter of summing those.
In more steps than result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands] (but with the same result):
result = []
for lower, upper in bands:
include = vin_range(height, lower, upper)
values_to_include = rainfall[include]
sum_of_rainfall = sum(values_to_include)
result.append(([lower, upper], sum_of_rainfall))
You can use np.bincount together with np.digitize. digitize creates an array of bin indices from the height array height and the bin boundaries bins. bincount then uses the bin indices to sum the data in array rain.
# set up
rain = np.random.randint(0,100,(5,5))/10
height = np.random.randint(0,10000,(5,5))/10
bins = [0,250,500,750,10000]
# compute
sums = np.bincount(np.digitize(height.ravel(),bins),rain.ravel(),len(bins)+1)
# result
sums
# array([ 0. , 37. , 35.6, 14.6, 22.4, 0. ])
# check against direct method
[rain[(height>=bins[i]) & (height<bins[i+1])].sum() for i in range(len(bins)-1)]
# [37.0, 35.6, 14.600000000000001, 22.4]
An example using the numpy ma module which allows to make masked arrays. From the docs:
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
which seems what you need in this case.
import numpy as np
pr = np.random.randint(0, 1000, size=(100, 100)) #precipitation map
he = np.random.randint(0, 1000, size=(100, 100)) #height map
bins = np.arange(0, 1001, 200)
values = []
for vmin, vmax in zip(bins[:-1], bins[1:]):
#creating the masked array, here minimum included inside bin, maximum excluded.
maskedpr = np.ma.masked_where((he < vmin) | (he >= vmax), pr)
values.append(maskedpr.sum())
values is the list of values for each bin, which you can plot.
The numpy.ma.masked_where function returns an array masked where condition is True. So you need to set the condition to be True outside the bins.
The sum() method performs the sum only where the array is not masked.
I have a lot of 750x750 images. I want to take the geometric mean of non-overlapping 5x5 patches from each image, and then for each image, average those geometric means to create one feature per image. I wrote the code below, and it seems to work just fine. But, I know it's not very efficient. Running it on 300 or so images takes around 60 seconds. I have about 3000 images. So, while it works for my purpose, it's not efficient. How can I improve this code?
#each sublist of gmeans will contain a list of 22500 geometric means
#corresponding to the non-overlapping 5x5 patches for a given image.
gmeans = [[],[],[],[],[],[],[],[],[],[],[],[]]
#the loop here populates gmeans.
for folder in range(len(subfolders)):
just_thefilename, colorsourceimages, graycroppedfiles = get_all_images(folder)
for items in graycroppedfiles:
myarray = misc.imread(items)
area_of_big_matrix=750*750
area_of_small_matrix= 5*5
how_many = area_of_big_matrix / area_of_small_matrix
n = 0
p = 0
mylist=[]
while len(mylist) < how_many:
mylist.append(gmean(myarray[n:n+5,p:p+5],None))
n=n+5
if n == 750:
p = p+5
n = 0
gmeans[folder].append(my list)
#each sublist of mean_of_gmeans will contain just one feature per image, the mean of the geometric means of the 5x5 patches.
mean_of_gmeans = [[],[],[],[],[],[],[],[],[],[],[],[]]
for folder in range(len(subfolders)):
for items in range(len(gmeans[0])):
mean_of_gmeans[folder].append((np.mean(gmeans[folder][items],dtype=np.float64)))
I can understand the suggestion to move this to the code review site,
but this problem provides a nice example of the power of using vectorized
numpy and scipy functions, so I'll give an answer.
The function below, cleverly called func, computes the desired value.
The key is to reshape the image into a four-dimensional array. Then
it can be interpreted as a two-dimensional array of two-dimensional
arrays, where the inner arrays are the 5x5 blocks.
scipy.stats.gmean can compute the geometric mean over more than one
dimension, so that is used to reduce the four-dimensional array to the
desired two-dimensional array of geometric means. The return value is the
(arithmetic) mean of those geometric means.
import numpy as np
from scipy.stats import gmean
def func(img, blocksize=5):
# img must be a 2-d array whose dimensions are divisible by blocksize.
if (img.shape[0] % blocksize) != 0 or (img.shape[1] % blocksize) != 0:
raise ValueError("blocksize does not divide the shape of img.")
# Reshape 'img' into a 4-d array 'blocks', so blocks[i, :, j, :] is
# the subarray with shape (blocksize, blocksize).
blocks_nrows = img.shape[0] // blocksize
blocks_ncols = img.shape[1] // blocksize
blocks = img.reshape(blocks_nrows, blocksize, blocks_ncols, blocksize)
# Compute the geometric mean over axes 1 and 3 of 'blocks'. This results
# in the array of geometric means with size (blocks_nrows, blocks_ncols).
gmeans = gmean(blocks, axis=(1, 3), dtype=np.float64)
# The return value is the average of 'gmeans'.
avg = gmeans.mean()
return avg
For example, here the function is applied to an array with shape (750, 750).
In [358]: np.random.seed(123)
In [359]: img = np.random.randint(1, 256, size=(750, 750)).astype(np.uint8)
In [360]: func(img)
Out[360]: 97.035648309350179
It isn't easy to verify that that is the correct result, so here is a much smaller example:
In [365]: np.random.seed(123)
In [366]: img = np.random.randint(1, 4, size=(3, 6))
In [367]: img
Out[367]:
array([[3, 2, 3, 3, 1, 3],
[3, 2, 3, 2, 3, 2],
[1, 2, 3, 2, 1, 3]])
In [368]: func(img, blocksize=3)
Out[368]: 2.1863131342986666
Here is the direct calculation:
In [369]: 0.5*(gmean(img[:,:3], axis=None) + gmean(img[:, 3:], axis=None))
Out[369]: 2.1863131342986666
I'm trying to resize a 2D numpy array of a given factor, obtaining a smaller array in output.
The array is read from an image file and some of the values should be NaN (Not a Number, np.nan from numpy): it is the result of remote sensing measurements from satellite and simply some pixels weren't measured.
The suitable package I found for this is scypy.misc.imresize, but each pixel in the output array containing a NaN is set to NaN, even if there are some valid data in the original pixels interpolated together.
My solution is appended here, what I've done is essentially :
create a new array based on the original array shape and the desired reduction factor
create an index array to address all the pixels of the original array to be averaged for each pixel in the new
cycle through the new array pixels and average all the not-NaN pixel to obtain the new array pixel value; it there are only NaN, the output will be NaN.
I'm planning to add keyword to choice between different output (average, median, standard deviation of the input pixels and so on).
It is working as expected, but on a ~1Mpx image it takes around 3 seconds. Due to my lack of experience in python I'm searching for improvements.
Do anyone have suggestion how to do it better and more efficiently?
Do anyone know a library that already implements all that stuff?
Thanks.
Here you have an example output for random pixel input generated with the code here below:
import numpy as np
import pylab as plt
from scipy import misc
def resize_2d_nonan(array,factor):
"""
Resize a 2D array by different factor on two axis sipping NaN values.
If a new pixel contains only NaN, it will be set to NaN
Parameters
----------
array : 2D np array
factor : int or tuple. If int x and y factor wil be the same
Returns
-------
array : 2D np array scaled by factor
Created on Mon Jan 27 15:21:25 2014
#author: damo_ma
"""
xsize, ysize = array.shape
if isinstance(factor,int):
factor_x = factor
factor_y = factor
elif isinstance(factor,tuple):
factor_x , factor_y = factor[0], factor[1]
else:
raise NameError('Factor must be a tuple (x,y) or an integer')
if not (xsize %factor_x == 0 or ysize % factor_y == 0) :
raise NameError('Factors must be intger multiple of array shape')
new_xsize, new_ysize = xsize/factor_x, ysize/factor_y
new_array = np.empty([new_xsize, new_ysize])
new_array[:] = np.nan # this saves us an assignment in the loop below
# submatrix indexes : is the average box on the original matrix
subrow, subcol = np.indices((factor_x, factor_y))
# new matrix indexs
row, col = np.indices((new_xsize, new_ysize))
# some output for testing
#for i, j, ind in zip(row.reshape(-1), col.reshape(-1),range(row.size)) :
# print '----------------------------------------------'
# print 'i: %i, j: %i, ind: %i ' % (i, j, ind)
# print 'subrow+i*new_ysize, subcol+j*new_xsize :'
# print i,'*',new_xsize,'=',i*factor_x
# print j,'*',new_ysize,'=',j*factor_y
# print subrow+i*factor_x,subcol+j*factor_y
# print '---'
# print 'array[subrow+i*factor_x,subcol+j*factor_y] : '
# print array[subrow+i*factor_x,subcol+j*factor_y]
for i, j, ind in zip(row.reshape(-1), col.reshape(-1),range(row.size)) :
# define the small sub_matrix as view of input matrix subset
sub_matrix = array[subrow+i*factor_x,subcol+j*factor_y]
# modified from any(a) and all(a) to a.any() and a.all()
# see https://stackoverflow.com/a/10063039/1435167
if not (np.isnan(sub_matrix)).all(): # if we haven't all NaN
if (np.isnan(sub_matrix)).any(): # if we haven no NaN at all
msub_matrix = np.ma.masked_array(sub_matrix,np.isnan(sub_matrix))
(new_array.reshape(-1))[ind] = np.mean(msub_matrix)
else: # if we haven some NaN
(new_array.reshape(-1))[ind] = np.mean(sub_matrix)
# the case assign NaN if we have all NaN is missing due
# to the standard values of new_array
return new_array
row , cols = 6, 4
a = 10*np.random.random_sample((row , cols))
a[0:3,0:2] = np.nan
a[0,2] = np.nan
factor_x = 2
factor_y = 2
a_misc = misc.imresize(a, .5, interp='nearest', mode='F')
a_2d_nonan = resize_2d_nonan(a,(factor_x,factor_y))
print a
print
print a_misc
print
print a_2d_nonan
plt.subplot(131)
plt.imshow(a,interpolation='nearest')
plt.title('original')
plt.xticks(arange(a.shape[1]))
plt.yticks(arange(a.shape[0]))
plt.subplot(132)
plt.imshow(a_misc,interpolation='nearest')
plt.title('scipy.misc')
plt.xticks(arange(a_misc.shape[1]))
plt.yticks(arange(a_misc.shape[0]))
plt.subplot(133)
plt.imshow(a_2d_nonan,interpolation='nearest')
plt.title('my.func')
plt.xticks(arange(a_2d_nonan.shape[1]))
plt.yticks(arange(a_2d_nonan.shape[0]))
EDIT
I add some modification to address ChrisProsser comment.
If I substitute the NaN with some other value, let say the average of the not-NaN pixels, it will affect all the subsequent calculation: the difference between the resampled original array and the resampled array with NaN substituted shows that 2 pixels changed their values.
My goal is simply skip all the NaN pixels.
# substitute NaN with the average value
ind_nonan , ind_nan = np.where(np.isnan(a) == False), np.where(np.isnan(a) == True)
a_substitute = np.copy(a)
a_substitute[ind_nan] = np.mean(a_substitute[ind_nonan]) # substitute the NaN with average on the not-Nan
a_substitute_misc = misc.imresize(a_substitute, .5, interp='nearest', mode='F')
a_substitute_2d_nonan = resize_2d_nonan(a_substitute,(factor_x,factor_y))
print a_2d_nonan-a_substitute_2d_nonan
[[ nan -0.02296697]
[ 0.23143208 0. ]
[ 0. 0. ]]
** 2nd EDIT**
To address the Hooked's answer I put some additional code. It is an iteresting idea, sadly it interpolates new values over pixels that should be "empty" (NaN) and for my small example generate more NaN than good values.
X , Y = np.indices((row , cols))
X_new , Y_new = np.indices((row/factor_x , cols/factor_y))
from scipy.interpolate import CloughTocher2DInterpolator as intp
C = intp((X[ind_nonan],Y[ind_nonan]),a[ind_nonan])
a_interp = C(X_new , Y_new)
print a
print
print a_interp
[[ nan, nan],
[ nan, nan],
[ nan, 6.32826577]])
You are operating on small windows of the array. Instead of looping through the array to make the windows, the array can be efficiently restructured by manipulating its strides. The numpy library provides the as_strided() function to help with that. An example is provided in the SciPy CookBook Stride tricks for the Game of Life.
The following will use a generalized sliding window function which I will include it at the end.
Determine the shape of the new array:
rows, cols = a.shape
new_shape = rows / 2, cols / 2
Restructure the array into the windows you need, and create an indexing array identifying NaNs:
# 2x2 windows of the original array
windows = sliding_window(a, (2,2))
# make a windowed boolean array for indexing
notNan = sliding_window(np.logical_not(np.isnan(a)), (2,2))
The new array can be made using a list comprehension or a generator expression.
# using a list comprehension
# make a list of the means of the windows, disregarding the Nan's
means = [window[index].mean() for window, index in zip(windows, notNan)]
new_array = np.array(means).reshape(new_shape)
# generator expression
# produces the means of the windows, disregarding the Nan's
means = (window[index].mean() for window, index in zip(windows, notNan))
new_array = np.fromiter(means, dtype = np.float32).reshape(new_shape)
The generator expression should conserve memory. Using itertools.izip() instead of ```zip`` should also help if memory is a problem. I just used the list comprehension for your solution.
Your function:
def resize_2d_nonan(array,factor):
"""
Resize a 2D array by different factor on two axis skipping NaN values.
If a new pixel contains only NaN, it will be set to NaN
Parameters
----------
array : 2D np array
factor : int or tuple. If int x and y factor wil be the same
Returns
-------
array : 2D np array scaled by factor
Created on Mon Jan 27 15:21:25 2014
#author: damo_ma
"""
xsize, ysize = array.shape
if isinstance(factor,int):
factor_x = factor
factor_y = factor
window_size = factor, factor
elif isinstance(factor,tuple):
factor_x , factor_y = factor
window_size = factor
else:
raise NameError('Factor must be a tuple (x,y) or an integer')
if (xsize % factor_x or ysize % factor_y) :
raise NameError('Factors must be integer multiple of array shape')
new_shape = xsize / factor_x, ysize / factor_y
# non-overlapping windows of the original array
windows = sliding_window(a, window_size)
# windowed boolean array for indexing
notNan = sliding_window(np.logical_not(np.isnan(a)), window_size)
#list of the means of the windows, disregarding the Nan's
means = [window[index].mean() for window, index in zip(windows, notNan)]
# new array
new_array = np.array(means).reshape(new_shape)
return new_array
I haven't done any time comparisons with your original function, but it should be faster.
Many solutions I've seen here on SO vectorize the operations to increase speed/efficiency - I don't quite have a handle on that and don't know if it can be applied to your problem. Searching SO for window, array, moving average, vectorize, and numpy should produce similar questions and answers for reference.
sliding_window() see attribution below:
import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product
def norm_shape(shape):
'''
Normalize numpy array shapes so they're always expressed as a tuple,
even for one-dimensional shapes.
Parameters
shape - an int, or a tuple of ints
Returns
a shape tuple
'''
try:
i = int(shape)
return (i,)
except TypeError:
# shape was not a number
pass
try:
t = tuple(shape)
return t
except TypeError:
# shape was not iterable
pass
raise TypeError('shape must be an int, or a tuple of ints')
def sliding_window(a,ws,ss = None,flatten = True):
'''
Return a sliding window over a in any number of dimensions
Parameters:
a - an n-dimensional numpy array
ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size
of each dimension of the window
ss - an int (a is 1D) or tuple (a is 2D or greater) representing the
amount to slide the window in each dimension. If not specified, it
defaults to ws.
flatten - if True, all slices are flattened, otherwise, there is an
extra dimension for each dimension of the input.
Returns
an array containing each n-dimensional window from a
'''
if None is ss:
# ss was not provided. the windows will not overlap in any direction.
ss = ws
ws = norm_shape(ws)
ss = norm_shape(ss)
# convert ws, ss, and a.shape to numpy arrays so that we can do math in every
# dimension at once.
ws = np.array(ws)
ss = np.array(ss)
shape = np.array(a.shape)
# ensure that ws, ss, and a.shape all have the same number of dimensions
ls = [len(shape),len(ws),len(ss)]
if 1 != len(set(ls)):
raise ValueError(\
'a.shape, ws and ss must all have the same length. They were %s' % str(ls))
# ensure that ws is smaller than a in every dimension
if np.any(ws > shape):
raise ValueError(\
'ws cannot be larger than a in any dimension.\
a.shape was %s and ws was %s' % (str(a.shape),str(ws)))
# how many slices will there be in each dimension?
newshape = norm_shape(((shape - ws) // ss) + 1)
# the shape of the strided array will be the number of slices in each dimension
# plus the shape of the window (tuple addition)
newshape += norm_shape(ws)
# the strides tuple will be the array's strides multiplied by step size, plus
# the array's strides (tuple addition)
newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
strided = ast(a,shape = newshape,strides = newstrides)
if not flatten:
return strided
# Collapse strided so that it has one more dimension than the window. I.e.,
# the new array is a flat list of slices.
meat = len(ws) if ws.shape else 0
firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
dim = firstdim + (newshape[-meat:])
# remove any dimensions with size 1
dim = filter(lambda i : i != 1,dim)
return strided.reshape(dim)
sliding_window() attribution
I originally found this on a blog page that is now a broken link:
Efficient Overlapping Windows with Numpy - http://www.johnvinyard.com/blog/?p=268
With a little searching it looks like it now resides in the Zounds github repository. Thanks John Vinyard.
Note this post is pretty old and there are a lot of SO Q&A's regarding sliding windows, rolling windows, and for images- patch extraction. There are a lot of one-offs using numpy's as_strided but this function still seems the only one to handle n-d windowing. scikits sklearn.feature_extraction.image library seems to be often cited for extracting or viewing image patches.
Interpolate the points, using scipy.interpolate, on a different grid. Below I've shown a cubic interpolator, which is slower but probably more accurate. You'll notice that the corner pixels are missing with this function, you could then use a linear or nearest neighbor interpolation to handle those last values.
import numpy as np
import pylab as plt
# Test data
row = np.linspace(-3,3,50)
X,Y = np.meshgrid(row,row)
Z = np.sqrt(X**2+Y**2) + np.cos(Y)
# Make some dead pixels, favor an edge
dead = np.random.random(Z.shape)
dead = (dead*X>.7)
Z[dead] =np.nan
from scipy.interpolate import CloughTocher2DInterpolator as intp
C = intp((X[~dead],Y[~dead]),Z[~dead])
new_row = np.linspace(-3,3,25)
xi,yi = np.meshgrid(new_row,new_row)
zi = C(xi,yi)
plt.subplot(121)
plt.title("Original signal 50x50")
plt.imshow(Z,interpolation='nearest')
plt.subplot(122)
plt.title("Interpolated signal 25x25")
plt.imshow(zi,interpolation='nearest')
plt.show()