Fastest way to create an array in Python [duplicate] - python

This question already has answers here:
NumPy array initialization (fill with identical values)
(9 answers)
Closed 8 years ago.
I want to create a 3D array in Python, filled with -1.
I tested these methods:
import numpy as np
l = 200
b = 100
h = 30
%timeit grid = [[[-1 for x in range(l)] for y in range(b)] for z in range(h)]
1 loops, best of 3: 458 ms per loop
%timeit grid = -1 * np.ones((l, b, h), dtype=np.int)
10 loops, best of 3: 35.5 ms per loop
%timeit grid = np.zeros((l, b, h), dtype=np.int) - 1
10 loops, best of 3: 31.7 ms per loop
%timeit grid = -1.0 * np.ones((l, b, h), dtype=np.float32)
10 loops, best of 3: 42.1 ms per loop
%%timeit
grid = np.empty((l,b,h))
grid.fill(-1.0)
100 loops, best of 3: 13.7 ms per loop
So obviously, the last one is the fastest. Does anybody has an even faster method or at least less memory intensive? Because it runs on a RaspberryPi.

The only thing I can think to add is that any of these methods will be faster with the dtype argument chosen to take up as little memory as possible.
Assuming you need no more space that int8, the method suggested by #RutgerKassies in the comments took this long on my system:
%timeit grid = np.full((l, b, h), -1, dtype=np.int8)
1000 loops, best of 3: 286 µs per loop
For comparison, not specifying dtype (defaulting to int32) took about 10 times longer with the same method:
%timeit grid = np.full((l, b, h), -1)
100 loops, best of 3: 3.61 ms per loop
Your fastest method was about as fast as np.full (sometimes beating it):
%%timeit
grid = np.empty((l,b,h))
grid.fill(-1)
100 loops, best of 3: 3.51 ms per loop
or, with dtype specified as int8,
1000 loops, best of 3: 255 µs per loop
Edit: This is probably cheating, but, well...
%timeit grid = np.lib.stride_tricks.as_strided(np.array(-1, dtype=np.int8), (l, b, h), (0, 0, 0))
100000 loops, best of 3: 12.4 us per loop
All that's happening here is that we begin with an array of length one, np.array([-1]), and then fiddle with the stride lengths so that grid looks exactly like an array with the required dimensions.
If you need an actual array, you can use grid = grid.copy(); this makes the creation of the grid array about as fast as the quickest approaches suggested on elsewhere this page.

This is a little faster for me. Might be different on a RPi though.
grid = np.empty((l,b,h))
grid[...] = -1
np.int8 is much faster if it's big enough for you
grid = np.empty((l,b,h), dtype=np.int8)
grid[...] = -1
%%timeit
grid = np.empty((l,b,h), dtype=np.int8)
grid[:] = -1
100 loops, best of 3: 6.91 ms per loop

Related

Is there a better way of implementing a histogram?

I have a 2D uint16 numpy array, I want to calculate the histogram for this array.
The function I use is:
def calc_hist(source):
hist = np.zeros(2**16, dtype='uint16')
for i in range(source.shape[0]):
for j in range(source.shape[1]):
hist[source[i, j]] = hist[source[i, j]] + 1
return hist
This function takes too much time to execute.
As I understand, there's a histogram function in the numpy module but I cant figure how to use it.
I've tried:
hist,_ = np.histogram(source.flatten(), bins=range(2**16))
But I get different results then my own function.
How can I call numpy.histogram to achieve the same result? or is there any other options?
For an input with data type uint16, numpy.bincount should work well:
hist = np.bincount(source.ravel(), minlength=2**16)
Your function is doing almost exactly what bincount does, but bincount is implemented in C.
For example, the following checks that this use of bincount gives the same result as your calc_hist function:
In [159]: rng = np.random.default_rng()
In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))
In [161]: h1 = calc_hist(x)
In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)
In [163]: (h1 == h2).all() # Verify that they are the same.
Out[163]: True
Check the performance with ipython %timeit command. You can see that using bincount is much faster.
In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As Corley Brigman pointed out, passing bins=range(x) determines the bin edges [1]. So, you will end up with x-1 bins with corresponding edges [0, 1), [1, 2), ..., [x-1, x].
In your case, you will have 2^16-1 bins. To fix it, simply use range(2**16+1).
[1] https://numpy.org/doc/stable/reference/generated/numpy.histogram.html?highlight=histogram#numpy.histogram

Numpy: Replace every value in the array with the mean of its adjacent elements

I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr
Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop
Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.
I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)

Broadcastable Numpy dot

I have an array H of dimensions (n0, n2) and an array W of dimensions (n0, n1, n2, n3) and I want to do the following operation:
(H[:, None, :, None] * W).sum(axis=(0, 2))
As far as I know, the above line does not use BLAS libraries. Is there a way to use numpy.dot or a similar function that uses BLAS to do the same computation (and still without copying the array H several times in memory)?
You have identified one way of doing this; I know of two others.
For a small example
In [365]: n0,n1,n2,n3=2,3,4,5
In [366]: H=np.ones((n0,n2));W=np.ones((n0,n1,n2,n3))
comparative times are:
In [362]: timeit np.tensordot(H,W,[(0,1),(0,2)])
10000 loops, best of 3: 32.8 µs per loop
In [363]: timeit np.einsum('ik,ijkl',H,W)
100000 loops, best of 3: 10.7 µs per loop
In [364]: timeit (H[:,None,:,None]*W).sum(axis=(0,2))
10000 loops, best of 3: 29.5 µs per loop
tensordot reshapes and transposes the inputs so it can call np.dot. einsum decodes the string, and does its own nditer in C.
https://stackoverflow.com/a/31129207/901925 has timings for another multidimensional dot, involving (100,)*(10,100,100)*(100,) arrays.

Efficient conversion of a 3D numpy array to a 1D numpy array

I have a 3D numpy array in this form:
>>>img.shape
(4504932, 2, 2)
>>> img
array([[[15114, 15306],
[15305, 15304]],
[[15305, 15306],
[15303, 15304]],
[[15305, 15306],
[15303, 15304]],
...,
[[15305, 15302],
[15305, 15302]]], dtype=uint16)
Which I want to convert to a 1D numpy array where each entry is the sum of each 2x2 submatrix in the above img numpy array.
I have been able to accomplish this using:
img_new = np.array([i.sum() for i in img])
>>> img_new
array([61029, 61218, 61218, ..., 61214, 61214, 61214], dtype=uint64)
Which is exactly what I want. But this is too slow (takes about 10 seconds). Is there a faster method I could use? I included above img.shape so you had an idea of the size of this numpy array.
EDIT - ADDITIONAL INFO:
My img array could also be a 3D array in the form of 4x4, 5x5, 7x7.. etc submatrices. This is specified by the variables sub_rows and sub_cols.
img.sum(axis=(1, 2))
sum allows you to specify an axis or axes along which to sum, rather than just summing the whole array. This allows NumPy to loop over the array in C and perform just a few machine instructions per sum, rather than having to go through the Python bytecode evaluation loop and create a ton of wrapper objects to stick in a list.
Using a numpy method (apply_over_axes) is usually quicker and indeed that is the case here. I just tested on a 4000x2x2 array:
img = np.random.rand(4000,2,2)
timeit(np.apply_along_axis(np.sum, img, [1,2]))
# 1000 loops, best of 3: 721 us per loop
timeit(np.array([i.sum() for i in img]))
# 100 loops, best of 3: 17.2 ms per loop
You can use np.einsum -
img_new = np.einsum('ijk->i',img)
Verify results
In [42]: np.array_equal(np.array([i.sum() for i in img]),np.einsum('ijk->i',img))
Out[42]: True
Runtime tests
In [34]: img = np.random.randint(0,10000,(10000,2,2)).astype('uint16')
In [35]: %timeit np.array([i.sum() for i in img]) # Original approach
10 loops, best of 3: 92.4 ms per loop
In [36]: %timeit img.sum(axis=(1, 2)) # From other solution
1000 loops, best of 3: 297 µs per loop
In [37]: %timeit np.einsum('ijk->i',img)
10000 loops, best of 3: 102 µs per loop

Using numpy.take for faster fancy indexing

EDIT I have kept the more complicated problem I am facing below, but my problems with np.take can be summarized better as follows. Say you have an array img of shape (planes, rows), and another array lut of shape (planes, 256), and you want to use them to create a new array out of shape (planes, rows), where out[p,j] = lut[p, img[p, j]]. This can be achieved with fancy indexing as follows:
In [4]: %timeit lut[np.arange(planes).reshape(-1, 1), img]
1000 loops, best of 3: 471 us per loop
But if, instead of fancy indexing, you use take and a python loop over the planes things can be sped up tremendously:
In [6]: %timeit for _ in (lut[j].take(img[j]) for j in xrange(planes)) : pass
10000 loops, best of 3: 59 us per loop
Can lut and img be in someway rearranged, so as to have the whole operation happen without python loops, but using numpy.take (or an alternative method) instead of conventional fancy indexing to keep the speed advantage?
ORIGINAL QUESTION
I have a set of look-up tables (LUTs) that I want to use on an image. The array holding the LUTs is of shape (planes, 256, n), and the image has shape (planes, rows, cols). Both are of dtype = 'uint8', matching the 256 axis of the LUT. The idea is to run the p-th plane of the image through each of the n LUTs from the p-th plane of the LUT.
If my lut and img are the following:
planes, rows, cols, n = 3, 4000, 4000, 4
lut = np.random.randint(-2**31, 2**31 - 1,
size=(planes * 256 * n // 4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31 - 1,
size=(planes * rows * cols // 4,)).view('uint8')
img = img.reshape(planes, rows, cols)
I can achieve what I am after using fancy indexing like this
out = lut[np.arange(planes).reshape(-1, 1, 1), img]
which gives me an array of shape (planes, rows, cols, n) , where out[i, :, :, j] holds the i-th plane of img run through the j-th LUT of the i-th plane of the LUT...
All is good, except for this:
In [2]: %timeit lut[np.arange(planes).reshape(-1, 1, 1), img]
1 loops, best of 3: 5.65 s per loop
which is completely unacceptable, especially since I have all of the following not so nice looking alternatives using np.take than run much faster:
A single LUT on a single plane runs about x70 faster:
In [2]: %timeit np.take(lut[0, :, 0], img[0])
10 loops, best of 3: 78.5 ms per loop
A python loop running through all the desired combinations finishes almost x6 faster:
In [2]: %timeit for _ in (np.take(lut[j, :, k], img[j]) for j in xrange(planes) for k in xrange(n)) : pass
1 loops, best of 3: 947 ms per loop
Even running all combinations of planes in the LUT and image and then discarding the planes**2 - planes unwanted ones is faster than fancy indexing:
In [2]: %timeit np.take(lut, img, axis=1)[np.arange(planes), np.arange(planes)]
1 loops, best of 3: 3.79 s per loop
And the fastest combination I have been able to come up with has a python loop iterating over the planes and finishes x13 faster:
In [2]: %timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
1 loops, best of 3: 434 ms per loop
The question of course is if there is no way of doing this with np.take without any python loop? Ideally whatever reshaping or resizing is needed should happen on the LUT, not the image, but I am open to whatever you people can come up with...
Fist of all I have to say I really liked your question. Without rearranging LUT or IMG the following solution worked:
%timeit a=np.take(lut, img, axis=1)
# 1 loops, best of 3: 1.93s per loop
But from the result you have to query the diagonal: a[0,0], a[1,1], a[2,2]; to get what you want. I've tried to find a way to do this indexing only for the diagonal elements, but still did not manage.
Here are some ways to rearrange your LUT and IMG:
The following works if the indexes in IMG are from 0-255, for the 1st plane, 256-511 for the 2nd plane, and 512-767 for the 3rd plane, but that would prevent you from using 'uint8', which can be a big issue...:
lut2 = lut.reshape(-1,4)
%timeit np.take(lut2,img,axis=0)
# 1 loops, best of 3: 716 ms per loop
# or
%timeit np.take(lut2, img.flatten(), axis=0).reshape(3,4000,4000,4)
# 1 loops, best of 3: 709 ms per loop
in my machine your solution is still the best option, and very adequate since you just need the diagonal evaluations, i.e. plane1-plane1, plane2-plane2 and plane3-plane3:
%timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
# 1 loops, best of 3: 677 ms per loop
I hope this can give you some insight about a better solution. It would be nice to look for more options with flatten(), and similar methods as np.apply_over_axes() or np.apply_along_axis(), that seem to be promising.
I used this code below to generate the data:
import numpy as np
num = 4000
planes, rows, cols, n = 3, num, num, 4
lut = np.random.randint(-2**31, 2**31-1,size=(planes*256*n//4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31-1,size=(planes*rows*cols//4,)).view('uint8')
img = img.reshape(planes, rows, cols)

Categories