Find nearest neighbour in a more pythonic way - python

A is a point, and P is a list of points.
I want to find which point P[i] is the closest to A, i.e. I want to find P[i_0] with:
i_0 = argmin_i || A - P[i]||^2
I do it this way:
import numpy as np
# P is a list of 4 points
P = [np.array([-1, 0, 7, 3]), np.array([5, -2, 8, 1]), np.array([0, 2, -3, 4]), np.array([-9, 11, 3, 4])]
A = np.array([1, 2, 3, 4])
distance = 1000000000 # better would be : +infinity
closest = None
for p in P:
delta = sum((p - A)**2)
if delta < distance:
distance = delta
closest = p
print closest # the closest point to A among all the points in P
It works, but how to do this in a shorter/more Pythonic way?
More generally in Python (and even without using Numpy), how to find k_0 such that D[k_0] = min D[k]? i.e. k_0 = argmin_k D[k]

A more Pythonic way of implementing the same algorithm you're using is to replace your loop with a call to min with a key function:
closest = min(P, key=lambda p: sum((p - A)**2))
Note that I'm using ** for exponentiation (^ is the binary-xor operator in Python).

A fully vectorized approach in numpy. Similar to the one of #MikeMüller, but using numpy's broadcasting to avoid lambda functions.
With the example data:
>>> P = [np.array([-1, 0, 7, 3]), np.array([5, -2, 8, 1]), np.array([0, 2, -3, 4]), np.array([-9, 11, 3, 4])]
>>> A = np.array([1, 2, 3, 4])
And making P a 2D numpy array:
>>> P = np.asarray(P)
>>> P
array([[-1, 0, 7, 3],
[ 5, -2, 8, 1],
[ 0, 2, -3, 4],
[-9, 11, 3, 4]])
It can be computed in one line using numpy:
>>> P[np.argmin(np.sum((P - A)**2, axis=1))]
Note that P - A, with P.shape = (N, 4) and A.shape = (4,) will brooadcast the substraction to all the rows of P (Pi = Pi - A).
For small N (number of rows in P), the pythonic approach is probably faster. For large values of N this should be significantly faster.

A NumPy version as one-liner:
clostest = P[np.argmin(np.apply_along_axis(lambda p: np.sum((p - A) **2), 1, P))]

Usage of the builtin min is the way for this:
import math
p1 = [1,2]
plst = [[1,3], [10,10], [5,5]]
res = min(plst, key=lambda x: math.sqrt(pow(p1[0]-x[0], 2) + pow(p1[1]-x[1], 2)))
print res
[1, 3]
Note that I just used plain python lists.

Related

Any efficient analogue of argsort for array of indices with NumPy?

I have an array of indices like a = [2, 4, 1, 0, 3] and I want to transform it into np.argsort(a) = [3, 2, 0, 4, 1].
The problem is that argsort has O(n*log(n)) timing, but for my case it may be O(n) and I even have code for this:
b = np.zeros(a.size)
for i in range(a.size):
b[a[i]] = i
The second problem is that cycles are slow in Python and I hope that it's possible to use some NumPy tricks to achieve the goal.
Do you have all numbers for 0 to len(a)-1?
Then use smart indexing:
a = [2, 4, 1, 0, 3]
b = np.empty(len(a), dtype=int) # or b = np.empty_like(a)
b[a] = np.arange(len(a))
b
output: array([3, 2, 0, 4, 1])

Joint accumulation of addition and multiplication

I have an array:
a = np.array([1, 2, 3, 1, 3, 4, 2, 4])
and I want to do the following calculation:
out = 0
for e in a:
out *= 3
out += e
With out as the output (4582 for the given example), is there a nice way to vectorize this? I think einsum can be used, but I couldn't figure how to write it.
One approach:
import numpy as np
a = np.array([1, 2, 3, 1, 3, 4, 2, 4])
powers = np.multiply.accumulate(np.repeat(3, len(a) - 1))
res = np.sum(powers[::-1] * a[:-1]) + a[-1]
print(res)
Output
4582
If you expand the loop, you'll notice that you are multiplying each value of a by a power of 3 and then summing the result.
Personally I would use reduce:
reduce(lambda x, y: x * 3 + y, a)

Average difference between ints in two lists in one line - Python

There are two non-empty lists, containing only ints, both have the same length.
Our function needs to return the average absolute difference between ints of same index.
For example, for the lists [1, 2, 3, 4] and [1, 1, 1, 1], the answer will be 1.5.
The function needs to be completed in one line.
I had a little something that does that, but as you can probably guess, it's not a one-liner:
def avg_diff(a, b):
sd = 0.0
for x, y in zip(a, b):
sd += abs(x - y)
return sd / len(a)
Thanks.
In Python 3.4 we got some statistic functions in the standard library, including statistics.mean.
Using this function and a generator-expression:
from statistics import mean
a = [1, 2, 3, 4]
b = [1, 1, 1, 1]
mean(abs(x - y) for x, y in zip(a, b))
# 1.5
a = [1, 2, 3, 4]
b = [1, 1, 1, 1]
sum([abs(i - j) for i, j in zip(a,b)]) / float(len(a))
If you are happy to use a 3rd party library, numpy provides one way:
import numpy as np
A = np.array([1, 2, 3, 4])
B = np.array([1, 1, 1, 1])
res = np.mean(np.abs(A - B))
# 1.5
Using the in-built sum and len functions on list:
lst1 = [1, 2, 3, 4]
lst2 = [1, 1, 1, 1]
diff = [abs(x-y) for x, y in zip(lst1, lst2)] # find index-wise differences
print(sum(diff)/len(diff)) # divide sum of differences by total
# 1.5

How to get lists of indices to unique values efficiently?

Is there a built-in method that would help me achieve the following efficiently: given an array, I need a list of arrays, each with indices to a different unique value of the array?
If f is the desired function,
b = f(a)
and
u, idxs = unique(a)
then
b[i] == where(idxs==i)[0]
I am aware that pandas.Series.groupby() can do this, but it may no be efficient to create a dict when there are over 10^5 unique integers.
If you have numpy >= 1.9 you can do:
>>> a = np.random.randint(5, size=10)
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> unq, unq_inv, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
>>> np.split(np.argsort(unq_inv), np.cumsum(unq_cnt[:-1]))
[array([0]), array([9]), array([1, 4, 8]), array([7]), array([2, 3, 5, 6])]
>>> unq
array([0, 1, 2, 3, 4])
In earlier versions, you can get the counts doing an extra:
>>> unq_cnt = np.bincount(unq_inv)
Also, if you want to make sure that the indices for each value are sorted, I think you will need to use a stable sort, e.g. np.argsort(unq_inv, kind='mergesort')
Thinking about what you seem to be after, which I think is minimizing calls to an expensive function, I don't think you need to do what you are asking. Say that your function was squaring, you could simply do:
>>> unq, unq_inv = np.unique(a, return_inverse=True)
>>> f_unq = unq**2
>>> f_a = f_unq[unq_inv]
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> f_a
array([ 0, 4, 16, 16, 4, 16, 16, 9, 4, 1])
def foo(a):
I=np.arange(a.shape[0])
d={}
while a.shape[0]:
x = a[0]
ii = a==x
d[x] = I[ii]
a = a[~ii]
I = I[~ii]
return d
In [767]: a
Out[767]: array([4, 4, 3, 0, 0, 2, 1, 1, 0, 3])
In [768]: foo(a)
Out[768]:
{0: array([3, 4, 8]),
1: array([6, 7]),
2: array([5]),
3: array([2, 9]),
4: array([0, 1])}
Is this the sort of dictionary that you want?
For small a this works fine.
An equivalent dictionary building function is:
def foo1(a):
unq = np.unique(a)
return {i:np.where(a==i)[0] for i in unq}
Off hand I don't see how unq_inv helps with building the dictionary.
foo is about 30% slower than foo1. I was hoping that by reducing the searched array each time a value was counted that I might gain some speed. But it looks like the extra bookkeeping chews up time. And the where time might not be that sensitive to the length of a.
For a2=np.random.randint(5000,size=100000) run times are on the order of 2-3 sec.
But np.random.randint(50000,size=1000000) takes too long to time (for either version).
On further experimentation, a 'dumb' approach using a collections.defaultdict is much faster (20x):
def food(a):
d = defaultdict(list)
for i,j in enumerate(a):
d[j].append(i)
return d
The 'too big' (1000000,) array takes only 1.1 sec;
Maybe do something like:
s = argsort(a)
d = diff(a[s])
starts = where(d)[0]
f = [s[starts[i:i+1]] for i in xrange(len(a))]
(code not checked)

How to update a Numpy array sequentially without loop

I have a Numpy array v and I want to update each element using a function on the current element of the array :
v[i] = f(v, i)
A basic way to do this is to use a loop
for i in xrange(2, len(v)):
v[i] = f(v, i)
Hence the value used to update v[i] is the updated array v. Is there a way to do these updates without a loop ?
For example,
v = [f(v, i) for i in xrange(len(v))]
does not work since the v[i-1] is not updated when it is used in the comprehensive list.
IThe function f can depend on several elements on the list, those with index lower than i should be updated and those with an index greater than i are not yet updated, as in the following example :
v = [1, 2, 3, 4, 5]
f = lambda v, i: (v[i-1] + v[i]) / v[i+1] # for i = [1,3]
f = lambda v, i: v[i] # for i = {0,4}
it should return
v = [1, (1+2)/3, (1+4)/4, ((5/4)+4)/5, 5]
There is a function for this:
import numpy
v = numpy.array([1, 2, 3, 4, 5])
numpy.add.accumulate(v)
#>>> array([ 1, 3, 6, 10, 15])
This works on many different types of ufunc:
numpy.multiply.accumulate(v)
#>>> array([ 1, 2, 6, 24, 120])
For an arbitrary function doing this kind of accumulation, you can make your own ufunc, although this will be much slower:
myfunc = numpy.frompyfunc(lambda x, y: x + y, 2, 1)
myfunc.accumulate([1, 2, 3], dtype=object)
#>>> array([1, 3, 6], dtype=object)
you can use sum function for sum the numbers before v[i]:
>>> v = [v[i] + sum(v[:i]) for i in xrange(len(v))]
>>> v
[1, 3, 6, 10, 15]
or in a better way you can use np.cumsum()
>>> np.cumsum(v)
array([ 1, 3, 6, 10, 15])

Categories