Halide non-contiguous memory layout - python

Is it possible to use non-c/fortran ordering in Halide? (where given dimensions x, y, c, x varies the fastest, then c varies the 2nd fastest (strides in numpy at least would be: .strides = (W*C, 1, W) Our memory layout is a stack of images where the channels of each image are stacked by scanline.
(Sorry if the layout still isn't clear enough, I can try to clarify). Using the python bindings, I always get ValueError: ndarray is not contiguous when trying to pass in my numpy array with .strides set.
I've tried changing the numpy array to use contiguous strides (without changing the memory layout) just to get it into Halide, then setting .set_stride in halide, but no luck. I'm just wanting to make sure I'm not trying to do something that can't/shouldn't be done.
I think this is similar to the line-by-line layout mentioned at https://halide-lang.org/tutorials/tutorial_lesson_16_rgb_generate.html, except more dimensions in C since the images are "stacked" along channel (to produce a W, H, C*image_count tensor)
Any advice would be much appreciated.
Thanks!

This is more of a numpy question than a Halide one. The following Halide code illustrates use of an array in the shape you are looking for (I think):
import halide as hl
import numpy as np;
x, y, c = hl.Var('x'), hl.Var('y'), hl.Var('c')
f = hl.Func('f')
f[x, y, c] = (x * 3) + (y * 12) + c
# This would be necessary for internally allocated buffers
# f.reorder_storage(x, c, y)
# These control output layout
f.output_buffer().dim(1).set_stride(12)
f.output_buffer().dim(2).set_stride(3)
# Probably wanted for efficiency
f.reorder(x, c, y)
result = f.realize(4, 5, 3)
print(result, result[0, 1, 1])
np_result = np.array(result)
print(np_result, np_result[0, 1, 1])
print(np_result.shape, " ", np_result.strides, " ", np_result.flags)
I'm not well versed in numpy and not sure how you would allocate an array in that layout from scratch but the answer might have to be something like lib.stride_tricks.as_strided.

Related

I have problems converting a (5,1,1,1) array to a (5,1). this because the other dimensions have no purpose

I am running python on my Raspberry Pi so computational time is a must, because I am designing an autonomous boat. Momentarily I have solved it but I think there must be an easier way. The code I have now is the following:
import numpy as np
a = np.array([[[[0]]],[[[0]]],[[[0]]],[[[0]]],[[[0]]]])
xtemp = []
for w in a:
for x in w:
for y in x:
for z in y:
xtemp.append(z)
xtemp = np.array(xtemp)
xtemp= np.reshape(xtemp, (xtemp.shape[0], 1))
print(xtemp)
print(xtemp.shape)
the reason behind the np.array is because I need it as an array and not a list. the reshape was to cover the problem of having a (5,) shape instead of a (5,1)

How resize images when those converted to numpy array

Consider we only have images as.npy file. Is it possible to resizing images without converting their to images (because I'm looking for a way that is fast when run the code).
for more info, I asked the way without converting to image, I have images but i don't want use those in code, because my dataset is too large and running with images is so slow, on the other hand, Im not sure which size is better for my imeges, So Im looking for a way that first convert images to npy and save .npy file and then preprocess npy file, for example resize the dimension of images.
Try PIL, maybe it's fast enough for you.
import numpy as np
from PIL import Image
arr = np.load('img.npy')
img = Image.fromarray(arr)
img.resize(size=(100, 100))
Note that you have to compute the aspect ratio if you want to keep it. Or you can use Image.thumbnail(), which can take an antialias filter.
There's also scikit-image, but I suspect it's using PIL under the hood. It works on NumPy arrays:
import skimage.transform as st
st.resize(arr, (100, 100))
I guess the other option is OpenCV.
If you are only dealing with numpy arrays, I think slicing would be enough
Say, the shape of the loaded numpy array is (m, n) (one channel), and the target shape is (a, b). Then, the stride can be (s1, s2) = (m // a, n // b)
So the original array can be sliced by
new_array = old_array[::s1, ::s2]
EDIT
To scale up an array is also quite straight forward if you use masks for advanced slicing. For example, the shape of the original array is (m, n), and the target shape is (a, b). Then, as an example
a, b = 300, 200
m, n = 3, 4
original = np.linspace(1, 12, 12).reshape(3, 4)
canvas = np.zeros((a, b))
(s1, s2) = (a // m, b // n) # the scalar
# the two masks
mask_x = np.concatenate([np.ones(s1) * ind for ind in range(m)])
mask_y = np.concatenate([np.ones(s2) * ind for ind in range(n)])
# make sure the residuals are taken into account
if len(mask_x) < a: mask_x = np.concatenate([mask_x, np.ones(len(mask_x) % a) * (m - 1)])
if len(mask_y) < b: mask_y = np.concatenate([mask_y, np.ones(len(mask_y) % b) * (n - 1)])
mask_x = mask_x.astype(np.int8).tolist()
mask_y = mask_y.astype(np.int8).tolist()
canvas = original[mask_x, :]
canvas = canvas[:, mask_y]

Einsum slower than explicit Numpy implementation for n-mode tensor-matrix product

I'm trying to implement the n-mode tensor-matrix product (as defined by Kolda and Bader: https://www.sandia.gov/~tgkolda/pubs/pubfiles/SAND2007-6702.pdf) efficiently in Python using Numpy. The operation effectively gets down to (for matrix U, tensor X and axis/mode k):
Extract all vectors along axis k from X by collapsing all other axes.
Multiply these vectors on the left by U using standard matrix multiplication.
Insert the vectors again into the output tensor using the same shape, apart from X.shape[k], which is now equal to U.shape[0] (initially, X.shape[k] must be equal to U.shape[1], as a result of the matrix multiplication).
I've been using an explicit implementation for a while which performs all these steps separately:
Transpose the tensor to bring axis k to the front (in my full code I added an exception in case k == X.ndim - 1, in which case it's faster to leave it there and transpose all future operations, or at least in my application, but that's not relevant here).
Reshape the tensor to collapse all other axes.
Calculate the matrix multiplication.
Reshape the tensor to reconstruct all other axes.
Transpose the tensor back into the original order.
I would think this implementation creates a lot of unnecessary (big) arrays, so once I discovered np.einsum I thought this would speed things up considerably. However using the code below I got worse results:
import numpy as np
from time import time
def mode_k_product(U, X, mode):
transposition_order = list(range(X.ndim))
transposition_order[mode] = 0
transposition_order[0] = mode
Y = np.transpose(X, transposition_order)
transposed_ranks = list(Y.shape)
Y = np.reshape(Y, (Y.shape[0], -1))
Y = U # Y
transposed_ranks[0] = Y.shape[0]
Y = np.reshape(Y, transposed_ranks)
Y = np.transpose(Y, transposition_order)
return Y
def einsum_product(U, X, mode):
axes1 = list(range(X.ndim))
axes1[mode] = X.ndim + 1
axes2 = list(range(X.ndim))
axes2[mode] = X.ndim
return np.einsum(U, [X.ndim, X.ndim + 1], X, axes1, axes2, optimize=True)
def test_correctness():
A = np.random.rand(3, 4, 5)
for i in range(3):
B = np.random.rand(6, A.shape[i])
X = mode_k_product(B, A, i)
Y = einsum_product(B, A, i)
print(np.allclose(X, Y))
def test_time(method, amount):
U = np.random.rand(256, 512)
X = np.random.rand(512, 512, 256)
start = time()
for i in range(amount):
method(U, X, 1)
return (time() - start)/amount
def test_times():
print("Explicit:", test_time(mode_k_product, 10))
print("Einsum:", test_time(einsum_product, 10))
test_correctness()
test_times()
Timings for me:
Explicit: 3.9450525522232054
Einsum: 15.873924326896667
Is this normal or am I doing something wrong? I know there are circumstances where storing intermediate results can decrease complexity (e.g. chained matrix multiplication), however in this case I can't think of any calculations that are being repeated. Is matrix multiplication so optimized that it removes the benefits of not transposing (which technically has a lower complexity)?
I'm more familiar with the subscripts style of using einsum, so worked out these equivalences:
In [194]: np.allclose(np.einsum('ij,jkl->ikl',B0,A), einsum_product(B0,A,0))
Out[194]: True
In [195]: np.allclose(np.einsum('ij,kjl->kil',B1,A), einsum_product(B1,A,1))
Out[195]: True
In [196]: np.allclose(np.einsum('ij,klj->kli',B2,A), einsum_product(B2,A,2))
Out[196]: True
With a mode parameter, your approach in einsum_product may be best. But the equivalences help me visualize the calculation better, and may help others.
Timings should basically be the same. There's an extra setup time in einsum_product that should disappear in larger dimensions.
After updating Numpy, Einsum is only slightly slower than the explicit method, with or without multi-threading (see comments to my question).

numpy einsum with '...'

The code below is meant to conduct a linear coordinate transformation on a set of 3d coordinates. The transformation matrix is A, and the array containing the coordinates is x. The zeroth axis of x runs over the dimensions x, y, z. It can have any arbitrary shape beyond that.
Here's my attempt:
A = np.random.random((3, 3))
x = np.random.random((3, 4, 2))
x_prime = np.einsum('ij,j...->i...', A, x)
The output is:
x_prime = np.einsum('ij,j...->i...', A, x)
ValueError: operand 0 did not have enough dimensions
to match the broadcasting, and couldn't be extended
because einstein sum subscripts were specified at both
the start and end
If I specify the additional subscripts in x explicitly, the error goes away. In other words, the following works:
x_prime = np.einsum('ij,jkl->ikl', A, x)
I'd like x to be able to have any arbitrary number of axes after the zeroth axis, so the workaround I give about is not optimal. I'm actually not sure why the first einsum example is not working. I'm using numpy 1.6.1. Is this a bug, or am I misunderstanding the documentation?
Yep, it's a bug. It was fixed in this pull request: https://github.com/numpy/numpy/pull/4099
This was only merged a month ago, so it'll be a while before it makes it to a stable release.
EDIT: As #hpaulj mentions in the comment, you can work around this limitation by adding an ellipsis even when all indices are specified:
np.einsum('...ij,j...->i...', A, x)

root mean square in numpy and complications of matrix and arrays of numpy

Can anyone direct me to the section of numpy manual where i can get functions to accomplish root mean square calculations ...
(i know this can be accomplished using np.mean and np.abs .. isn't there a built in ..if no why?? .. just curious ..no offense)
can anyone explain the complications of matrix and arrays (just in the following case):
U is a matrix(T-by-N,or u say T cross N) , Ue is another matrix(T-by-N)
I define k as a numpy array
U[ind,:] is still matrix
in the following fashion
k = np.array(U[ind,:])
when I print k or type k in ipython
it displays following
K = array ([[2,.3 .....
......
9]])
You see the double square brackets (which makes it multi-dim i guess)
which gives it the shape = (1,N)
but I can't assign it to array defined in this way
l = np.zeros(N)
shape = (,N) or perhaps (N,) something like that
l[:] = k[:]
error:
matrix dimensions incompatible
Is there a way to accomplish the vector assignment which I intend to do ... Please don't tell me do this l = k (that defeats the purpose ... I get different errors in program .. I know the reasons ..If you need I may attach the piece of code)
writing a loop is the dumb way .. which I'm using for the time being ...
I hope I was able to explain .. the problems I'm facing ..
regards ...
For the RMS, I think this is the clearest:
from numpy import mean, sqrt, square, arange
a = arange(10) # For example
rms = sqrt(mean(square(a)))
The code reads like you say it: "root-mean-square".
For rms, the fastest expression I have found for small x.size (~ 1024) and real x is:
def rms(x):
return np.sqrt(x.dot(x)/x.size)
This seems to be around twice as fast as the linalg.norm version (ipython %timeit on a really old laptop).
If you want complex arrays handled more appropriately then this also would work:
def rms(x):
return np.sqrt(np.vdot(x, x)/x.size)
However, this version is nearly as slow as the norm version and only works for flat arrays.
For the RMS, how about
norm(V)/sqrt(V.size)
I don't know why it's not built in. I like
def rms(x, axis=None):
return sqrt(mean(x**2, axis=axis))
If you have nans in your data, you can do
def nanrms(x, axis=None):
return sqrt(nanmean(x**2, axis=axis))
Try this:
U = np.zeros((N,N))
ind = 1
k = np.zeros(N)
k[:] = U[ind,:]
I use this for RMS, all using NumPy, and let it also have an optional axis similar to other NumPy functions:
import numpy as np
rms = lambda V, axis=None: np.sqrt(np.mean(np.square(V), axis))
If you have complex vectors and are using pytorch, the vector norm is the fastest approach on CPU & GPU:
import torch
batch_size, length = 512, 4096
batch = torch.randn(batch_size, length, dtype=torch.complex64)
scale = 1 / torch.sqrt(torch.tensor(length))
rms_power = batch.norm(p=2, dim=-1, keepdim=True)
batch_rms = batch / (rms_power * scale)
Using batch vdot like goodboy's approach is 60% slower than above. Using naïve method similar to deprecated's approach is 85% slower than above.

Categories