How can I make this code quicker? - python

I have a lot of 750x750 images. I want to take the geometric mean of non-overlapping 5x5 patches from each image, and then for each image, average those geometric means to create one feature per image. I wrote the code below, and it seems to work just fine. But, I know it's not very efficient. Running it on 300 or so images takes around 60 seconds. I have about 3000 images. So, while it works for my purpose, it's not efficient. How can I improve this code?
#each sublist of gmeans will contain a list of 22500 geometric means
#corresponding to the non-overlapping 5x5 patches for a given image.
gmeans = [[],[],[],[],[],[],[],[],[],[],[],[]]
#the loop here populates gmeans.
for folder in range(len(subfolders)):
just_thefilename, colorsourceimages, graycroppedfiles = get_all_images(folder)
for items in graycroppedfiles:
myarray = misc.imread(items)
area_of_big_matrix=750*750
area_of_small_matrix= 5*5
how_many = area_of_big_matrix / area_of_small_matrix
n = 0
p = 0
mylist=[]
while len(mylist) < how_many:
mylist.append(gmean(myarray[n:n+5,p:p+5],None))
n=n+5
if n == 750:
p = p+5
n = 0
gmeans[folder].append(my list)
#each sublist of mean_of_gmeans will contain just one feature per image, the mean of the geometric means of the 5x5 patches.
mean_of_gmeans = [[],[],[],[],[],[],[],[],[],[],[],[]]
for folder in range(len(subfolders)):
for items in range(len(gmeans[0])):
mean_of_gmeans[folder].append((np.mean(gmeans[folder][items],dtype=np.float64)))

I can understand the suggestion to move this to the code review site,
but this problem provides a nice example of the power of using vectorized
numpy and scipy functions, so I'll give an answer.
The function below, cleverly called func, computes the desired value.
The key is to reshape the image into a four-dimensional array. Then
it can be interpreted as a two-dimensional array of two-dimensional
arrays, where the inner arrays are the 5x5 blocks.
scipy.stats.gmean can compute the geometric mean over more than one
dimension, so that is used to reduce the four-dimensional array to the
desired two-dimensional array of geometric means. The return value is the
(arithmetic) mean of those geometric means.
import numpy as np
from scipy.stats import gmean
def func(img, blocksize=5):
# img must be a 2-d array whose dimensions are divisible by blocksize.
if (img.shape[0] % blocksize) != 0 or (img.shape[1] % blocksize) != 0:
raise ValueError("blocksize does not divide the shape of img.")
# Reshape 'img' into a 4-d array 'blocks', so blocks[i, :, j, :] is
# the subarray with shape (blocksize, blocksize).
blocks_nrows = img.shape[0] // blocksize
blocks_ncols = img.shape[1] // blocksize
blocks = img.reshape(blocks_nrows, blocksize, blocks_ncols, blocksize)
# Compute the geometric mean over axes 1 and 3 of 'blocks'. This results
# in the array of geometric means with size (blocks_nrows, blocks_ncols).
gmeans = gmean(blocks, axis=(1, 3), dtype=np.float64)
# The return value is the average of 'gmeans'.
avg = gmeans.mean()
return avg
For example, here the function is applied to an array with shape (750, 750).
In [358]: np.random.seed(123)
In [359]: img = np.random.randint(1, 256, size=(750, 750)).astype(np.uint8)
In [360]: func(img)
Out[360]: 97.035648309350179
It isn't easy to verify that that is the correct result, so here is a much smaller example:
In [365]: np.random.seed(123)
In [366]: img = np.random.randint(1, 4, size=(3, 6))
In [367]: img
Out[367]:
array([[3, 2, 3, 3, 1, 3],
[3, 2, 3, 2, 3, 2],
[1, 2, 3, 2, 1, 3]])
In [368]: func(img, blocksize=3)
Out[368]: 2.1863131342986666
Here is the direct calculation:
In [369]: 0.5*(gmean(img[:,:3], axis=None) + gmean(img[:, 3:], axis=None))
Out[369]: 2.1863131342986666

Related

Average of a 3D numpy slice based on 2D arrays

I am trying to calculate the average of a 3D array between two indices on the 1st axis. The start and end indices vary from cell to cell and are represented by two separate 2D arrays that are the same shape as a slice of the 3D array.
I have managed to implement a piece of code that loops through the pixels of my 3D array, but this method is painfully slow in the case of my array with a shape of (70, 550, 350). Is there a way to vectorise the operation using numpy or xarray (the arrays are stored in an xarray dataset)?
Here is a snippet of what I would like to optimise:
# My 3D raster containing values; shape = (time, x, y)
values = np.random.rand(10, 55, 60)
# A 2D raster containing start indices for the averaging
start_index = np.random.randint(0, 4, size=(values.shape[1], values.shape[2]))
# A 2D raster containing end indices for the averaging
end_index = np.random.randint(5, 9, size=(values.shape[1], values.shape[2]))
# Initialise an array that will contain results
mean_array = np.zeros_like(values[0, :, :])
# Loop over 3D raster to calculate the average between indices on axis 0
for i in range(0, values.shape[1]):
for j in range(0, values.shape[2]):
mean_array[i, j] = np.mean(values[start_index[i, j]: end_index[i, j], i, j], axis=0)
One way to do this without loops is to zero-out the entries you don't want to use, compute the sum of the remaining items, then divide by the number of nonzero entries. For example:
i = np.arange(values.shape[0])[:, None, None]
mean_array_2 = np.where((i >= start_index) & (i < end_index), values, 0).sum(0) / (end_index - start_index)
np.allclose(mean_array, mean_array_2)
# True
Note that this assumes that the indices are in the range 0 <= i < values.shape[0]; if this is not the case you can use np.clip or other means to standardize the indices before computation.

Is there a method in numpy that computes the mean on every 2d array from a 3d array?

For example:
import numpy as np
a = np.array([[[1,2,3],[1,2,3]],[[4,5,6],[7,8,7]]])
print(a.shape)
# (2, 2, 3)
So, on every 2d grid (3 grids on the example above) i want ot compute the mean:
mean = [np.mean(a[:, :, i]) for i in range(3)]
print(mean)
# [3.25, 4.25, 4.75]
Is there a method in numpy that achieves this?
I tried using mean on the axis but the result is not as expected.
You can accomplish this using np.mean(axis = ...) and specifying a tuple of dimensions to average over
a.mean(axis=tuple(range(len(a.shape) - 1)))
This will compute the mean on every dimension/axis except the last one (note how the range of axis indices goes from 0 up to len - 1 (exclusive) thus ignoring the last axis.
This method is extensible to deeper arrays. For example if you have an array of shape (2, 6, 5, 4, 3), it will compute mean as a.mean(axis=(0, 1, 2, 3)) thus giving you an array of 3 means (corresponding to 3 elements in the last dimension)

Sum up data on specific (multiple) ranges

I'm certain there's a good way to do this but I'm blanking on the right search terms to google, so I'll ask here instead. My problem is this:
I have 2 2-dimensional array, both with the same dimensions. One array (array 1) is the accumulated precipitation at (x,y) points. The other (array 2) is the topographic height of the same (x,y) grid. I want to sum up array 1 between specific heights of array 2, and create a bar graph with topographic height bins a the x-axis and total accumulated precipitation on the y axis.
So I want to be able to declare a list of heights (say [0, 100, 200, ..., 1000]) and for each bin, sum up all precipitation that occurred within that bin.
I can think of a few complicated ways to do this, but I'm guessing there's probably an easier way that I'm not thinking of. My gut instinct is to loop through my list of heights, mask anything outside of that range, sum up remaining values, add those to a new array, and repeat.
I'm wondering is if there's a built-in numpy or similar library that can do this more efficiently.
This code shows what you're asking for, some explanation in comments:
import numpy as np
def in_range(x, lower_bound, upper_bound):
# returns wether x is between lower_bound (inclusive) and upper_bound (exclusive)
return x in range(lower_bound, upper_bound)
# vectorize allows you to easily 'map' the function to a numpy array
vin_range = np.vectorize(in_range)
# representing your rainfall
rainfall = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# representing your height map
height = np.array([[1, 2, 1], [2, 4, 2], [3, 6, 3]])
# the bands of height you're looking to sum
bands = [[0, 2], [2, 4], [4, 6], [6, 8]]
# computing the actual results you'd want to chart
result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands]
print(result)
The next to last line is where the magic happens. vin_range(height, *band) uses the vectorized function to create a numpy array of boolean values, with the same dimensions as height, that has True if a value of height is in the range given, or False otherwise.
By using that array to index the array with the target values (rainfall), you get an array that only has the values for which the height is in the target range. Then it's just a matter of summing those.
In more steps than result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands] (but with the same result):
result = []
for lower, upper in bands:
include = vin_range(height, lower, upper)
values_to_include = rainfall[include]
sum_of_rainfall = sum(values_to_include)
result.append(([lower, upper], sum_of_rainfall))
You can use np.bincount together with np.digitize. digitize creates an array of bin indices from the height array height and the bin boundaries bins. bincount then uses the bin indices to sum the data in array rain.
# set up
rain = np.random.randint(0,100,(5,5))/10
height = np.random.randint(0,10000,(5,5))/10
bins = [0,250,500,750,10000]
# compute
sums = np.bincount(np.digitize(height.ravel(),bins),rain.ravel(),len(bins)+1)
# result
sums
# array([ 0. , 37. , 35.6, 14.6, 22.4, 0. ])
# check against direct method
[rain[(height>=bins[i]) & (height<bins[i+1])].sum() for i in range(len(bins)-1)]
# [37.0, 35.6, 14.600000000000001, 22.4]
An example using the numpy ma module which allows to make masked arrays. From the docs:
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
which seems what you need in this case.
import numpy as np
pr = np.random.randint(0, 1000, size=(100, 100)) #precipitation map
he = np.random.randint(0, 1000, size=(100, 100)) #height map
bins = np.arange(0, 1001, 200)
values = []
for vmin, vmax in zip(bins[:-1], bins[1:]):
#creating the masked array, here minimum included inside bin, maximum excluded.
maskedpr = np.ma.masked_where((he < vmin) | (he >= vmax), pr)
values.append(maskedpr.sum())
values is the list of values for each bin, which you can plot.
The numpy.ma.masked_where function returns an array masked where condition is True. So you need to set the condition to be True outside the bins.
The sum() method performs the sum only where the array is not masked.

Summarize ndarray by 2d array in Python

I want to summarize a 3d array dat using indices contained in a 2d array idx.
Consider the example below. For each margin along dat[:, :, i], I want to compute the median according to some index idx. The desired output (out) is a 2d array, whose rows record the index and columns record the margin. The following code works but is not very efficient. Any suggestions?
import numpy as np
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
out = np.empty((3, 3))
for i in np.unique(idx):
out[i,] = np.median(dat[idx==i], axis = 0)
print(out)
Output:
[[ 1.5 2.5 3.5]
[ 6. 7. 8. ]
[ 9. 10. 11. ]]
To visualize the problem better, I will refer to the 2x2 dimensions of the array as the rows and columns, and the 3 dimension as depth. I will refer to vectors along the 3rd dimension as "pixels" (pixels have length 3), and planes along the first two dimensions as "channels".
Your loop is accumulating a set of pixels selected by the mask idx == i, and taking the median of each channel within that set. The result is an Nx3 array, where N is the number of distinct incides that you have.
One day, generalized ufuncs will be ubiquitous in numpy, and np.median will be such a function. On that day, you will be able to use reduceat magic1 to do something like
unq, ind = np.unique(idx, return_inverse=True)
np.median.reduceat(dat.reshape(-1, dat.shape[-1]), np.r_[0, np.where(np.diff(unq[ind]))[0]+1])
1 See Applying operation to unevenly split portions of numpy array for more info on the specific type of magic.
Since this is not currently possible, you can use scipy.ndimage.median instead. This version allows you to compute medians over a set of labeled areas in an array, which is exactly what you have with idx. This method assumes that your index array contains N densely packed values, all of which are in range(N). Otherwise the reshaping operations will not work properly.
If that is not the case, start by transforming idx:
_, ind = np.unique(idx, return_inverse=True)
idx = ind.reshape(idx.shape)
OR
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
Since you are actually computing a separate median for each region and channel, you will need to have a set of labels for each channel. Flesh out idx to have a distinct set of indices for each channel:
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
Now index has an identical set of regions defined in each channel, which you can use in scipy.ndimage.median:
out = scipy.ndimage.median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
The input labels must be densely packed from zero to offset * chan for index=range(offset * chan) to work properly, and the reshape operation to have the right number of elements. The final transpose is just an artifact of how the labels are arranged.
Here is the complete product, along with an IDEOne demo of the result:
import numpy as np
from scipy.ndimage import median
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
def summarize(dat, idx):
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
return median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
print(summarize(dat, idx))

Vectorize eigenvalue calculation in Numpy

I would like a numpy-sh way of vectorizing the calculation of eigenvalues, such that I can feed it a matrix of matrices and it would return a matrix of the respective eigenvalues.
For example, in the code below, B is the block 6x6 matrix composed of 4 copies of the 3x3 matrix A.
C is what I would like to see as output, i.e. an array of dimension (2,2,3) (because A has 3 eigenvalues).
This is of course a very simplified example, in the general case the matrices A can have any size (although they are still square), and the matrix B is not necessarily formed of copies of A, but different A1, A2, etc (all of same size but containing different elements).
import numpy as np
A = np.array([[0, 1, 0],
[0, 2, 0],
[0, 0, 3]])
B = np.bmat([[A, A], [A,A]])
C = np.array([[np.linalg.eigvals(B[0:3,0:3]),np.linalg.eigvals(B[0:3,3:6])],
[np.linalg.eigvals(B[3:6,0:3]),np.linalg.eigvals(B[3:6,3:6])]])
Edit: if you're using a version of numpy >= 1.8.0, then np.linalg.eigvals operates over the last two dimensions of whatever array you hand it, so if you reshape your input to an (n_subarrays, nrows, ncols) array you'll only have to call eigvals once:
import numpy as np
A = np.array([[0, 1, 0],
[0, 2, 0],
[0, 0, 3]])
# the input needs to be an array, since matrices can only be 2D.
B = np.repeat(A[np.newaxis,...], 4, 0)
# for arbitrary input arrays you could do something like:
# B = np.vstack(a[np.newaxis,...] for a in input_arrays)
# but for this to work it will be necessary for each element in
# 'input_arrays' to have the same shape
# eigvals will operate over the last two dimensions of the array and return
# a (4, 3) array of eigenvalues
C = np.linalg.eigvals(B)
# reshape this output so that it matches your original example
C.shape = (2, 2, 3)
If your input arrays don't all have the same dimensions, e.g. input_arrays[0].shape == (2, 2), input_arrays[1].shape == (3, 3) etc. then you could only vectorize this calculation across subsets with matching dimensions.
If you're using an older version of numpy then unfortunately I don't think there's any way to vectorize the calculation of the eigenvalues over multiple input arrays - you'll just have to loop over your inputs in Python instead.
You could just do something like this
C = np.array([[np.linalg.eigvals(B[i:i+3, j:j+3])
for i in xrange(0, B.shape[0], 3)]
for j in xrange(0, B.shape[1], 3)])
Perhaps a nicer approach is to use the block_view function from https://stackoverflow.com/a/5078155/1352250:
B_blocks = block_view(B)
C = np.array([[np.linalg.eigvals(m) for m in v] for v in B_blocks])
Update
As ali_m points out, this method is a form of syntactic sugar that will not reduce the overhead incurred from calling eigvals a large number of times. While this overhead should be small if each matrix it is applied to is large-ish, for the 6x6 matrices that the OP is interested in, it is not trivial (see the comments below; according to ali_m, there might be a factor of three difference between the version I give above, and the version he posted that uses Numpy >= 1.8.0).

Categories