Related
Let's consider, there are two arrays I and J which determine the neighbor pairs:
I = np.array([0, 0, 1, 2, 2, 3])
J = np.array([1, 2, 0, 0, 3, 2])
Which means element 0 has two neighbors 1 and 2. Element 1 has only 0 as a neighbor and so on.
What is the most efficient way to create arrays of all neighbor triples I', J', K' such that j is neighbor of i and k is neighbor of j given the condition i, j, and k are different elements (i != j != k)?
Ip = np.array([0, 0, 2, 3])
Jp = np.array([2, 2, 0, 2])
Kp = np.array([0, 3, 1, 0])
Of course, one way is to loop over each element. Is there a more efficient algorithm? (working with 10-500 million elements)
I would go with a very simple approach and use pandas (I and J are your numpy arrays):
import pandas as pd
df1 = pd.DataFrame({'I': I, 'J': J})
df2 = df1.rename(columns={'I': 'K', 'J': 'I'})
result = pd.merge(df2, df1, on='I').query('K != J')
The advantage is that pandas.merge relies on a very fast underlying numerical implementation. Also, you can make the computation even faster for example by merging using indexes.
To reduce the memory that this approach needs, it would be probably very useful to reduce the size of df1 and df2 before merging them (for example, by changing the dtype of their columns to something that suits your need).
Here is an example of how to optimize speed and memory of the computation:
from timeit import timeit
import numpy as np
import pandas as pd
I = np.random.randint(0, 10000, 1000000)
J = np.random.randint(0, 10000, 1000000)
df1_64 = pd.DataFrame({'I': I, 'J': J})
df1_32 = df1_64.astype('int32')
df2_64 = df1_64.rename(columns={'I': 'K', 'J': 'I'})
df2_32 = df1_32.rename(columns={'I': 'K', 'J': 'I'})
timeit(lambda: pd.merge(df2_64, df1_64, on='I').query('K != J'), number=1)
# 18.84
timeit(lambda: pd.merge(df2_32, df1_32, on='I').query('K != J'), number=1)
# 9.28
There is no particularly magic algorithm to generate all of the triples. You can avoid re-fetching a node's neighbors by an orderly search, but that's about it.
Make an empty list, N, of nodes to check.
Add some start node, S, to N
While N is not empty
Pop a node off the list; call it A.
Make a set of its neighbors, A'.
for each neighbor B of A
for each element a of A'
Generate the triple (a, A, B)
Add B to the list of nodes to check, if it has not already been checked.
Does that help? There are still several details to handle in the algorithm above, such as avoiding duplicate generation, and fine points of moving through cliques.
What you are looking for is all paths of length 3 in the graph. You can achieve this simply with the following recursive algorithm:
import networkx as nx
def findPaths(G,u,n):
"""Returns a list of all paths of length `n` starting at vertex `u`."""
if n==1:
return [[u]]
paths = [[u]+path for neighbor in G.neighbors(u) for path in findPaths(G,neighbor,n-1) if u not in path]
return paths
# Generating graph
vertices = np.unique(I)
edges = list(zip(I,J))
G = nx.Graph()
G.add_edges_from(edges)
# Grabbing all 3-paths
paths = [path for v in vertices for path in findPaths(G,v,3)]
paths
>>> [[0, 2, 3], [1, 0, 2], [2, 0, 1], [3, 2, 0]]
This is an initial solution to your problem using networkx, an optimized library for graph computations:
import numpy as np
import networkx as nx
I = np.array([0, 0, 1, 2, 2, 3])
J = np.array([1, 2, 0, 0, 3, 2])
I_, J_, K_ = [], [], [],
num_nodes = np.max(np.concatenate([I,J])) + 1
A = np.zeros((num_nodes, num_nodes))
A[I,J] = 1
print("Adjacency Matrix:")
print(A)
G = nx.from_numpy_matrix(A)
for i in range(num_nodes):
first_neighbors = list(G.neighbors(i))
for j in first_neighbors:
second_neighbor = list(G.neighbors(j))
second_neighbor_no_circle = list(filter(lambda node: node != i, second_neighbor))
num_second_neighbors = len(second_neighbor_no_circle)
if num_second_neighbors > 0:
I_.extend(num_second_neighbors * [i])
J_.extend(num_second_neighbors * [j])
K_.extend(second_neighbor_no_circle)
I_, J_, K_ = np.array(I_), np.array(J_), np.array(K_)
print("result:")
print(I_)
print(J_)
print(K_)
####### Output #######
Adjacency Matrix:
[[0. 1. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 1.]
[0. 0. 1. 0.]]
result:
[0 1 2 3]
[2 0 0 2]
[3 2 1 0]
I used %%timeit on the code above without print statements to check the running time:
49 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Complexity analysis:
Finding all the neighbors of all the neighbors is essentially taking 2 steps in a Depth First Search algorithm. This could take, depending on the graph's topology, up to O(|V| + |E|) where |E| is the number of edges in the graph and |V| is the number of vertices.
To the best of my knowledge, there is no better algorithm on a general graph.
However, if you do know some special properties about the graph, the running time could be more tightly bounded or perhaps alter the current algorithm based on this knowledge.
For instance, if you know all the vertices have at most d edges, and the graph has one connected component, the bound of this implementation becomes O(2d) which is quite better if d << |E|.
Let me know if you have any questions.
I have an n-by-3 index array (think of triangles indexing points) and a list of float values associated with the triangles. I now want to get for each index ("point") the minimum value, i.e., check all rows which contain the index, say, 0, and get the minimum value from vals across the respective rows:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = [
numpy.min(vals[numpy.any(a == i, axis=1)])
for i in range(6)
]
# out = numpy.array([0.1, 0.1, 0.1, 0.5, 0.3, 0.6])
This solution is inefficient because it does a full array comparison for every i.
This problem is quite similar to numpy's ufuncs, but numpy.min.at doesn't exist.
Any hints?
Approach #1
One approach based on array-assignment to setup a 2D array filled up NaNs, using those a values as column indices (so assumes those to be integers), then mapping vals into it and looking for nan-skipped min values for the final output -
nr,nc = len(a),a.max()+1
m = np.full((nr,nc),np.nan)
m[np.arange(nr)[:,None],a] = vals[:,None]
out = np.nanmin(m,axis=0)
Approach #2
Another one again based on array-assignment, but uses masking and np.minimum.reduceat in favor of dealing with NaNs -
nr,nc = len(a),a.max()+1
m = np.zeros((nc,nr),dtype=bool)
m[a.T,np.arange(nr)] = 1
c = m.sum(1)
shift_idx = np.r_[0,c[:-1].cumsum()]
out = np.minimum.reduceat(np.broadcast_to(vals,m.shape)[m],shift_idx)
Approach #3
Another based on argsort (assuming you have all integers from 0 to a.max() in a) -
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
out = np.minimum.reduceat(vals[sidx//a.shape[1]],np.r_[0,c[:-1].cumsum()])
Approach #4
For memory efficiency and hence perf. and also to complete the set -
from numba import njit
#njit
def numba1(a, vals, out):
m,n = a.shape
for j in range(m):
for i in range(n):
e = a[j,i]
if vals[j] < out[e]:
out[e] = vals[j]
return out
def func1(a, vals, outlen=None): # feed in output length as outlen if known
if outlen is not None:
N = outlen
else:
N = a.max()+1
out = np.full(N,np.inf)
return numba1(a, vals, out)
You may switch to pd.GroupBy or itertools.groupby if your for loop goes way beyond 6.
For instance,
r = n.ravel()
pd.Series(np.arange(len(r))//3).groupby(r).apply(lambda s: vals[s].min())
This solution would be faster for long loops, and probably slower for small loops (< 50)
Here is one based on this Q&A:
If you have pythran, compile
file <stb_pthr.py>
import numpy as np
#pythran export sort_to_bins(int[:], int)
def sort_to_bins(idx, mx):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 2, int)
for i in range(idx.size):
cnts[idx[i]+2] += 1
for i in range(2, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(idx)
for i in range(idx.size):
res[cnts[idx[i]+1]] = i
cnts[idx[i]+1] += 1
return res, cnts[:-1]
Otherwise the script will fall back to a sparse matrix based approach which is only slightly slower:
import numpy as np
try:
from stb_pthr import sort_to_bins
HAVE_PYTHRAN = True
except:
HAVE_PYTHRAN = False
from scipy.sparse import csr_matrix
def sort_to_bins_sparse(idx, mx):
if mx==-1:
mx = idx.max() + 1
aux = csr_matrix((np.ones_like(idx),idx,np.arange(idx.size+1)),
(idx.size,mx)).tocsc()
return aux.indices, aux.indptr
if not HAVE_PYTHRAN:
sort_to_bins = sort_to_bins_sparse
def f_op():
mx = a.max() + 1
return np.fromiter((np.min(vals[np.any(a == i, axis=1)])
for i in range(mx)),vals.dtype,mx)
def f_pp():
idx, bb = sort_to_bins(a.reshape(-1),-1)
res = np.minimum.reduceat(vals[idx//3], bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
def f_div_3():
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
bb = np.r_[0,c.cumsum()]
res = np.minimum.reduceat(vals[sidx//a.shape[1]],bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
a = np.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = np.array([0.1, 0.5, 0.3, 0.6])
assert np.all(f_op()==f_pp())
from timeit import timeit
a = np.random.randint(0,1000,(10000,3))
vals = np.random.random(10000)
assert len(np.unique(a))==1000
assert np.all(f_op()==f_pp())
print("1000/1000 labels, 10000 rows")
print("op ", timeit(f_op, number=10)*100, 'ms')
print("pp ", timeit(f_pp, number=100)*10, 'ms')
print("div", timeit(f_div_3, number=100)*10, 'ms')
a = 1 + 2 * np.random.randint(0,5000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
a = 1 + 2 * np.random.randint(0,100000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
Sample run (timings include #Divakar approach 3 for reference):
1000/1000 labels, 10000 rows
op 145.1122640981339 ms
pp 0.7944229000713676 ms
div 2.2905819199513644 ms
5000/10000 labels, 1000000 rows
pp 113.86540920939296 ms
div 417.2476712032221 ms
100000/200000 labels, 1000000 rows
pp 158.23634970001876 ms
div 486.13436080049723 ms
UPDATE: #Divakar's latest (approach 4) is hard to beat, being essentially a C implementation. Nothing wrong with that except that jitting is not an option but a requirement here (the unjitted code is no fun to run). If one accepts that, the same can, of course, be done with pythran:
pythran -O3 labeled_min.py
file <labeled_min.py>
import numpy as np
#pythran export labeled_min(int[:,:], float[:])
def labeled_min(A, vals):
mn = np.empty(A.max()+1)
mn[:] = np.inf
M,N = A.shape
for i in range(M):
v = vals[i]
for j in range(N):
c = A[i,j]
if v < mn[c]:
mn[c] = v
return mn
Both give another massive speedup:
from labeled_min import labeled_min
func1() # do not measure jitting time
print("nmb ", timeit(func1, number=100)*10, 'ms')
print("pthr", timeit(lambda:labeled_min(a,vals), number=100)*10, 'ms')
Sample run:
nmb 8.41792532010004 ms
pthr 8.104007659712806 ms
pythran comes out a few percent faster but this is only because I moved vals lookup out of the inner loop; without that they are all but equal.
For comparison, the previously best with and without non python helpers on the same problem:
pp 114.04887529788539 ms
pp (py only) 147.0821460010484 ms
Apparently, numpy.minimum.at exists:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = numpy.full(6, numpy.inf)
numpy.minimum.at(out, a.reshape(-1), numpy.repeat(vals, 3))
I have an array a of length N and need to implement the following operation:
With p in [0..1]. This equation is a lossy sum, where the first indexes in the sum are weighted by a greater loss (p^{n-i}) than the last ones. The last index (i=n) is always weigthed by 1. if p = 1, then the operation is a simple cumsum.
b = np.cumsum(a)
If if p != 1, I can implement this operation in a cpu-inefficient way:
b = np.empty(np.shape(a))
# I'm using the (-1,-1,-1) idiom for reversed ranges
p_vec = np.power(p, np.arange(N-1, 0-1, -1))
# p_vec[0] = p^{N-1}, p_vec[-1] = 1
for n in range(N):
b[n] = np.sum(a[:n+1]*p_vec[-(n+1):])
Or in a memory-inefficient but vectorized way (IMO is cpu inefficient too, since a lot of work is wasted):
a_idx = np.reshape(np.arange(N+1), (1, N+1)) - np.reshape(np.arange(N-1, 0-1, -1), (N, 1))
a_idx = np.maximum(0, a_idx)
# For N=4, a_idx looks like this:
# [[0, 0, 0, 0, 1],
# [0, 0, 0, 1, 2],
# [0, 0, 1, 2, 3],
# [0, 1, 2, 3, 4]]
a_ext = np.concatenate(([0], a,), axis=0) # len(a_ext) = N + 1
p_vec = np.power(p, np.arange(N, 0-1, -1)) # len(p_vec) = N + 1
b = np.dot(a_ext[a_idx], p_vec)
Is there a better way to achieve this 'lossy' cumsum?
What you want is a IIR filter, you can use scipy.signal.lfilter(), here is the code:
Your code:
import numpy as np
N = 10
p = 0.8
np.random.seed(0)
x = np.random.randn(N)
y = np.empty_like(x)
p_vec = np.power(p, np.arange(N-1, 0-1, -1))
for n in range(N):
y[n] = np.sum(x[:n+1]*p_vec[-(n+1):])
y
the output:
array([1.76405235, 1.81139909, 2.42785725, 4.183179 , 5.21410119,
3.19400307, 3.50529088, 2.65287549, 2.01908154, 2.02586374])
By using lfilter():
from scipy import signal
y = signal.lfilter([1], [1, -p], x)
print(y)
the output:
array([1.76405235, 1.81139909, 2.42785725, 4.183179 , 5.21410119,
3.19400307, 3.50529088, 2.65287549, 2.01908154, 2.02586374])
I have recently hit a roadblock when it comes to performance. I know how to manually loop and do the interpolation from the origin cell to all the other cells by brute-forcing/looping each row and column in 2d array.
however when I process a 2D array of a shape say (3000, 3000), the linear spacing and the interpolation come to a standstill and severely hurt performance.
I am looking for a way I can optimize this loop, I am aware of vectorization and broadcasting just not sure how I can apply it in this situation.
I will explain it with code and figures
import numpy as np
from scipy.ndimage import map_coordinates
m = np.array([
[10,10,10,10,10,10],
[9,9,9,10,9,9],
[9,8,9,10,8,9],
[9,7,8,0,8,9],
[8,7,7,8,8,9],
[5,6,7,7,6,7]])
origin_row = 3
origin_col = 3
m_max = np.zeros(m.shape)
m_dist = np.zeros(m.shape)
rows, cols = m.shape
for col in range(cols):
for row in range(rows):
# Get spacing linear interpolation
x_plot = np.linspace(col, origin_col, 5)
y_plot = np.linspace(row, origin_row, 5)
# grab the interpolated line
interpolated_line = map_coordinates(m,
np.vstack((y_plot,
x_plot)),
order=1, mode='nearest')
m_max[row][col] = max(interpolated_line)
m_dist[row][col] = np.argmax(interpolated_line)
print(m)
print(m_max)
print(m_dist)
As you can see this is very brute force, and I have managed to broadcast all the code around this part but stuck on this part.
here is an illustration of what I am trying to achieve, I will go through the first iteration
1.) the input array
2.) the first loop from 0,0 to origin (3,3)
3.) this will return [10 9 9 8 0] and the max will be 10 and the index will be 0
5.) here is the output for the sample array I used
Here is an update of the performance based on the accepted answer.
To speed up the code, you could first create the x_plot and y_plot outside of the loops instead of creating them several times each one:
#this would be outside of the loops
num = 5
lin_col = np.array([np.linspace(i, origin_col, num) for i in range(cols)])
lin_row = np.array([np.linspace(i, origin_row, num) for i in range(rows)])
then you could access them in each loop by x_plot = lin_col[col] and y_plot = lin_row[row]
Second, you can avoid both loops by using map_coordinates on more than just one v_stack for each couple (row, col). To do so, you can create all the combinaisons of x_plot and y_plot by using np.tile and np.ravel such as:
arr_vs = np.vstack(( np.tile( lin_row, cols).ravel(),
np.tile( lin_col.ravel(), rows)))
Note that ravel is not used at the same place each time to get all the combinaisons. Now you can use map_coordinates with this arr_vs and reshape the result with the number of rows, cols and num to get each interpolated_line in the last axis of a 3D-array:
arr_map = map_coordinates(m, arr_vs, order=1, mode='nearest').reshape(rows,cols,num)
Finally, you can use np.max and np.argmax on the last axis of arr_map to get the results m_max and m_dist. So all the code would be:
import numpy as np
from scipy.ndimage import map_coordinates
m = np.array([
[10,10,10,10,10,10],
[9,9,9,10,9,9],
[9,8,9,10,8,9],
[9,7,8,0,8,9],
[8,7,7,8,8,9],
[5,6,7,7,6,7]])
origin_row = 3
origin_col = 3
rows, cols = m.shape
num = 5
lin_col = np.array([np.linspace(i, origin_col, num) for i in range(cols)])
lin_row = np.array([np.linspace(i, origin_row, num) for i in range(rows)])
arr_vs = np.vstack(( np.tile( lin_row, cols).ravel(),
np.tile( lin_col.ravel(), rows)))
arr_map = map_coordinates(m, arr_vs, order=1, mode='nearest').reshape(rows,cols,num)
m_max = np.max( arr_map, axis=-1)
m_dist = np.argmax( arr_map, axis=-1)
print (m_max)
print (m_dist)
and you get like expected:
#m_max
array([[10, 10, 10, 10, 10, 10],
[ 9, 9, 10, 10, 9, 9],
[ 9, 9, 9, 10, 8, 9],
[ 9, 8, 8, 0, 8, 9],
[ 8, 8, 7, 8, 8, 9],
[ 7, 7, 8, 8, 8, 8]])
#m_dist
array([[0, 0, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0],
[0, 2, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 2, 0, 0, 0, 0],
[1, 1, 2, 1, 2, 1]])
EDIT: lin_col and lin_row are related, so you can do faster:
if cols >= rows:
arr = np.arange(cols)[:,None]
lin_col = arr + (origin_col-arr)/(num-1.)*np.arange(num)
lin_row = lin_col[:rows] + np.linspace(0, origin_row - origin_col, num)[None,:]
else:
arr = np.arange(rows)[:,None]
lin_row = arr + (origin_row-arr)/(num-1.)*np.arange(num)
lin_col = lin_row[:cols] + np.linspace(0, origin_col - origin_row, num)[None,:]
Here is a sort-of-vectorized approach. It is not very optimized and there may be one or two index-off-by-one errors, but it may give you ideas.
Two examples a monochrome 384x512 test pattern and a "real" 3-channel 768x1024 image. Both are uint8.
This takes half a minute on my machine.
For larger images one would require more RAM than I have (8GB). Or one would have to break it down into smaller chunks.
And the code
import numpy as np
def rays(img, ctr):
M, N, *d = img.shape
aidx = 2*(slice(None),) + (img.ndim-2)*(None,)
m, n = ctr
out = np.empty_like(img)
offsI = np.empty(img.shape, np.uint16)
offsJ = np.empty(img.shape, np.uint16)
img4, out4, I4, J4 = ((x[m:, n:], x[m:, n::-1], x[m::-1, n:], x[m::-1, n::-1]) for x in (img, out, offsI, offsJ))
for i, o, y, x in zip(img4, out4, I4, J4):
for _ in range(2):
M, N, *d = i.shape
widths = np.arange(1, M+1, dtype=np.uint16).clip(None, N)
I = np.arange(M, dtype=np.uint16).repeat(widths)
J = np.ones_like(I)
J[0] = 0
J[widths[:-1].cumsum()] -= widths[:-1]
J = J.cumsum(dtype=np.uint16)
ii = np.arange(1, 2*M-1, dtype=np.uint16) // 2
II = ii.clip(None, I[:, None])
jj = np.arange(2*M-2, dtype=np.uint32) // 2 * 2 + 1
jj[0] = 0
JJ = ((1 + jj) * J[:, None] // (2*(I+1))[:, None]).astype(np.uint16).clip(None, J[:, None])
idx = i[II, JJ].argmax(axis=1)
II, JJ = (np.take_along_axis(ZZ[aidx] , idx[:, None], 1)[:, 0] for ZZ in (II, JJ))
y[I, J], x[I, J] = II, JJ
SH = II, JJ, *np.ogrid[tuple(map(slice, img.shape))][2:]
o[I, J] = i[SH]
i, o = i.swapaxes(0, 1), o.swapaxes(0, 1)
y, x = x.swapaxes(0, 1), y.swapaxes(0, 1)
return out, offsI, offsJ
from scipy.misc import face
f = face()
fr, *fidx = rays(f, (200, 400))
s = np.uint8((np.arange(384)[:, None] % 41 < 2)&(np.arange(512) % 41 < 2))
s = 255*s + 128*s[::-1, ::-1] + 64*s[::-1] + 32*s[:, ::-1]
sr, *sidx = rays(s, (200, 400))
import Image
Image.fromarray(f).show()
Image.fromarray(fr).show()
Image.fromarray(s).show()
Image.fromarray(sr).show()
I perform the cross product of contiguous segments of a trajectory (xy coordinates) using the following script:
In [129]:
def func1(xy, s):
size = xy.shape[0]-2*s
out = np.zeros(size)
for i in range(size):
p1, p2 = xy[i], xy[i+s] #segment 1
p3, p4 = xy[i+s], xy[i+2*s] #segment 2
out[i] = np.cross(p1-p2, p4-p3)
return out
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
tmp1 = p1-p2
tmp2 = p4-p3
return tmp1[:, 0] * tmp2[:, 1] - tmp2[:, 0] * tmp1[:, 1]
In [136]:
xy = np.array([[1,2],[2,3],[3,4],[5,6],[7,8],[2,4],[5,2],[9,9],[1,1]])
func2(xy, 2)
Out[136]:
array([ 0, -3, 16, 1, 22])
func1 is particularly slow because of the inner loop so I rewrote the cross-product myself (func2) which is orders of magnitude faster.
Is it possible to use the numpy einsum function to make the same calculation?
einsum computes sums of products only, but you could shoehorn the cross-product into a sum of products by reversing the columns of tmp2 and changing the sign of the first column:
def func3(xy, s):
size = xy.shape[0]-2*s
tmp1 = xy[0:size] - xy[s:size+s]
tmp2 = xy[2*s:size+2*s] - xy[s:size+s]
tmp2 = tmp2[:, ::-1]
tmp2[:, 0] *= -1
return np.einsum('ij,ij->i', tmp1, tmp2)
But func3 is slower than func2.
In [80]: xy = np.tile(xy, (1000, 1))
In [104]: %timeit func1(xy, 2)
10 loops, best of 3: 67.5 ms per loop
In [105]: %timeit func2(xy, 2)
10000 loops, best of 3: 73.2 µs per loop
In [106]: %timeit func3(xy, 2)
10000 loops, best of 3: 108 µs per loop
Sanity check:
In [86]: np.allclose(func1(xy, 2), func3(xy, 2))
Out[86]: True
I think the reason why func2 is beating einsum here is because the cost of setting of the loop in einsum for just 2 iterations is too expensive compared to just manually writing out the sum, and the reversing and multiplying eat up some time as well.
np.cross is a smart little beast, that can handle broadcasting without any issue. So you can rewrite your func2 as:
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
return np.cross(p1-p2, p4-p3)
and it will produce the correct result:
>>> func2(xy, 2)
array([ 0, -3, 16, 1, 22])
In the latest numpy it will likely run a tad faster than your code, as it was rewritten to minimize intermediate array creation. You can look at the source code (pure Python) here.